西班牙专利ES2697693A1 IP NUCLEUS, ARCHITECTURE COMPRISING AN IP NUCLEUS AND AN IP NUCLEUS DESIGN PROCEDURE (Machine-transl

专利PDF首页>>西班牙专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
IP core, architecture comprising an IP core and procedure for designing an IP core. A configurable and programmable IP processing core for the computation of a plurality of matrix products, in which both the data to be processed and the results obtained are serially transferred, comprising: The IP core comprises: a data entry block for providing a set of vectors representing a first and a second matrix whose product is to be computed, wherein said data entry block comprises: a first sub-block and a second sub-block; a memory block comprising N memory elements associated with a respective output of said second sub-block of the data entry block; a matrix multiplier-vector in a fixed point to implement a multiplication-accumulation operation; a block comprising at least one activation function configured to be applied to the output of said fixed-vector matrix-vector multiplier block; a block for storing the outputs of the at least one activation function and for reading the outputs of said storage components; a FIFO block and a data output block comprising a row counter and a column counter. System on chip comprising at least one IP.FPGA core comprising at least one IP core. Procedure for designing an IP core. (Machine-translation by Google Translate, not legally binding)
公开号:ES2697693A1
申请号:ES201730963
申请日:2017-07-24
公开日:2019-01-25
发明作者:Corral Unai Martinez；Oyarzabal Koldobika Basterrechea
申请人:Euskal Herriko Unibertsitatea；
IPC主号:

专利说明:

[0001]
[0002]
[0003]
[0004] FIELD OF THE INVENTION
[0005]
[0006] The present invention pertains to the field of embedded data processing systems. More specifically, the invention pertains to the field of embedded data processing systems, such as processing systems for computing neural networks, especially for neural networks of pre-feed ( feedforward). The invention has special application in the field of design of acceleration systems of data processing, such as in machine learning, on embedded platforms, such as Systems in the Chip ( SoC, System on Chip) and FPGAs ( Field Programmable Gate Array) .
[0007]
[0008] BACKGROUND OF THE INVENTION
[0009]
[0010] Machine learning is a technological area that is experiencing a huge development in recent years, mainly due to three factors: (1) Availability of huge amounts of data (due to the development of the Internet of Things or IoT , advances in sensor technology and the widespread use of digital video and audio, among others); (2) Great hardware development (microelectronic technology) and consequent increase in computing capacity; and (3) Advances in the computational intelligence algorithms themselves (in this sense, the emergence of the Deep Learning concept and its application success, especially in the field of artificial vision, has generated great interest in the use of networks neuronal in industrial applications).
[0011]
[0012] Among the different processing options for machine learning, artificial neural networks (ANNs) are one of the most popular predictor models (classifier, regressor) and are receiving more attention because of their wide applicability. However, applications of neural processing (processing based on RNAs) for machine learning require a large computing capacity which, due to its fundamentally parallel and massively interconnected architecture, can only be satisfied by using specific processors of high performance and high efficiency. In the case of the systems of embedded processing (as opposed to mass processing systems based on cloud computing or cloud computing and large computers with a high volume and consumption), whose areas of applicability are constantly growing (autonomous systems, mobile platforms, IoT, automobile intelligent / autonomous, etc.), these computational demands represent a greater challenge due to the need to be implemented in small hardware, low consumption and low cost. FPGAs are, in this sense, one of the platforms with more potential in this field, since they allow applying the most advanced digital design techniques (parallelism, segmentation, specific design with fine granularity, both in logic and in memory) in the implementation of complex processing systems in a single chip, so that the highest yields can be obtained in terms of processing capacity per unit of power consumed. However, the design of this type of systems is complex and laborious, with relatively long design cycles, which requires the involvement of expert designers and lengthens the arrival of products to the market ( time to market). Consequently, the current trend is to offer designers libraries of predesigned units, preferably configurable to their needs, in the form of intellectual property blocks (IP blocks or IP cores), often configurable and scalable, so that they can be integrated into their designs adjusting to the needs of your final applications. The concept of IP core is intimately linked to the concept of reusability and the use of CAD tools (EDA) in the design and synthesis of digital systems. Ideally, the IP cores are completely portable, that is, that some standard description language has been used for their design (or a logical netlist format) and that there is no associated information regarding the final implementation technology. In this way, designers of digital systems can make use of these IP cores, which are organized in libraries, integrating them directly into their designs as simple black boxes, often configurable by defining certain parameters, which only expose the input / output ports for its interconnection with the rest of the system, often through buses.
[0013]
[0014] As an example, within neural networks, convolution neural networks (CNN ) and deep neural networks (DNN ) represent a computational model that is gaining popularity due to its potentiality to solve problems of the human-computer interface, such as the interpretation of images. This is because these networks can achieve great precision by emulating behavior of the optic nerve. The kernel of the model is an algorithm that receives as input a large data set (for example, pixels of an image) and applies to that data a set of transformations (convolutions in the case of CNN) according to predefined functions. The transformed data can then be taken to a neural network to detect patterns. As in the general case of the ANNs, due to the specific computational pattern of the CNN and DNN, the general-purpose processors are not efficient in implementations based on CNN or DNN.
[0015]
[0016] The patent application US 2015/0170021 A1 describes a processor device that includes a processor core and a number of calculation modules, each of which is configurable to perform operations of a CNN system. A first set of calculation modules of the processor device is configured to perform convolution operations, a second set of calculation modules is reconfigured to perform averaging operations and a third set of calculation modules is reconfigured to perform scalar product operations. However, it is a fixed precision processor, selectable between 8 bits and 16 bits, and a different word size can not be indicated for each stage of the circuit. As a result, it is not possible to optimize the computing precision of the processor, with the consequent negative impact on resources consumed (occupied area).
[0017]
[0018] Tianshi Chen et. Al. Have proposed an accelerator for machine learning (neural networks) consisting of an input buffer for input neurons, an output buffer for output neurons and a third buffer for synaptic weights, connected to a computational block configured to perform synaptic computations and neuronal. The accelerator also has a logic of control ("DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", SIGPLAN Not., Vol 49, n ° 4, pp. 269-284, Feb. 2014). In this proposal, the memory elements disposed at the input / output of the kernel are buffers with DMA (a buffer for the input data, another buffer for the weights and a third buffer for storing the results) connected to a common memory interface , so absolute addresses are needed to access the external memory, that is, there is no hierarchy of memory, as far as the arithmetic resources are concerned, they are arranged for maximum parallelization, using one or more summers trees (in English). , adder-tree) Also, the numerical format used is fixed and invariable (16 bits in a fixed point) Furthermore, according to the control logic, it is understood that the sizes of the different layers that make up the model must comply with certain restrictions. so that the choice of subdivision parameters yield integer results.
[0019] In turn, in the proposal of J. Qiu et. ("Going deeper with embedded FPGA platform for convolutional neural network", Proceedings of the 2016 ACM / SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA'16, Monterrey, California, USA: ACM, 2016, pp 26 { 35, isbn: 978-1-4503-3856-1, doi: 10.1145 / 2847263.2847265.), The design of the arithmetic unit is very similar to the one proposed by Tianshi Chen et. to the. in "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning", with the difference that the buffer to load the weights is of the FIFO type.
[0020]
[0021] In turn, Chen Zhang et al. have proposed an accelerator design based on FPGA for Deep Convolution Neural Networks, in which they try to optimize both the logical resources and the memory bandwidth ("Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," FPGA 15, February 22-24, 2015, Monterey, California, USA, ACM 978-1-4503-3315-3 / 15/02). The proposed accelerator is capable of executing acceleration works along different layers without the need to reprogram the FPGA. This disclosure focuses on the algorithmic reordering and optimal choice of parameters, taking into consideration the characteristics of the target platform. For the description of the system, this proposal uses the high-level automatic synthesis tool HLS (Vivado). The use of a tool of this type, compared to a design carried out at a lower level (RTL), entails several limitations in terms of optimizing data communication (alternatives to the use of shared memories, non-sequential readings in the FIFOs etc.) and, in particular, to the organization and access to memory (management of independent addressing spaces, dynamic memory allocation, etc.). Also, the numeric format used is 32 bits (in a fixed or floating point), with no option to configure it.
[0022]
[0023] On the other hand, M. Motamedi et al. deepen ("Design space exploration of fpga-based deep convolutional neural networks", in 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2016, pp. 575 {580.Doi: 10.1109 / ASPDAC.2016.7428073) in the possibilities of maximum parallelization and use of the locality, using modules called Parallel Convolution Engines (PCD), composed of multiple multipliers and an adder-tree. This design is only useful in the computation of convolution layers.
[0024]
[0025] Finally, H. Li et. al., propose ("A high performance fpga-based accelerator for largescale convolutional neural networks", 2016, 26th International Conference on Field Programmable Logic and Applications (FLP), Aug. 2016, pp. 1 {9. doi: 10.1109 / FLP .
[0026] 2016.7577308) the use of an instance of the arithmetic module for each layer of the neural network model, introducing double buffers between each stage. Regarding the design of each arithmetic module, a MACC systolic array is proposed, of equal length to the filter to be applied. The weights are loaded by multiplexers and the data through a shift register whose size is limited to the length of the convolution filter. The data is not used until the record is completely filled. They also use adder-trees to accumulate partial results computed in parallel.
[0027]
[0028] DESCRIPTION OF THE INVENTION
[0029]
[0030] The present invention provides a processing module that solves the drawbacks of previous proposals.
[0031]
[0032] In the context of the present disclosure, the terms "processor", "neuronal processor", "processing core", "IP core" (from the English IP core), "IP block" (from the English IP block) and IP module ( English IP module) are interchangeable. the present disclosure describes an IP core of a fully configurable neural processor by the end user, ie, for instance by the designer of a SoC ^(System on Chip SoC). This IP core it allows the acceleration of computation and is especially applicable to the acceleration of computation in machine learning algorithms Among the characteristics of the processor of the invention, it can be highlighted that:
[0033]
[0034] -The neuronal processor can adapt automatically, during the process of synthesis of the neural network that is going to be implemented, to the resources available in the selected target device (for example, FPGA) by means of a folding technique (in English, folding ⁾ and reuse of neuronal layers. From the sources of description, for example in VHDL, a synthesizer (CAD) generates a netlist suitable for the target platform or technology (depending on whether it is an FPGA or an ASIC). That is, a fully parameterized description of the neural processor (IP core) has been made in which the end user can indicate, among other things, how many resources he / she wants / can use for the final implementation of the processor so that the system makes the adjustments necessary to fold the architecture of the network so that the synthesized processor is finally more series (which implies greater reuse of less hardware resources, and therefore, a little slower), or more parallel (which implies less reuse) , or none, in the most extreme case, of a greater amount of hardware resources, and therefore, faster). This feature allows Embed the processor (or IP core) both in low cost FPGAs (small size and consumption and reduced logical resources) and in high performance FPGAs. Moreover, once a certain network size has been synthesized, the use of the number of neurons in each network layer can be selected at runtime, making the use of dynamic reconfiguration techniques unnecessary. This is achieved by activating / deactivating an enabling signal present in each processing element (DSP activation function). That is, in contrast to the automatic adaptation of the number of hardware resources (DSPs, etc.) that will be synthesized, it is possible to activate / deactivate neurons that have already been implemented ("programming" vs. "configuration" in the synthesis phase). ). To carry out this programming, the user does not need to manage all the enabling signals individually, but a register is generated in the control module that allows to indicate how many neurons of the synthesized ones are going to be used or activated in the last fold or folded (since in all the previous ones, if there are any, all physical neurons are used). The system detects when the value of that record has been modified, and a counter is used along with a shift register to sequentially generate all (de) enable signals. In a possible implementation, in which in most executions the number of physical neurons used is greater than the number of neurons disabled, initially all are enabled. Thus, the configuration latency is equal to the number of neurons disabled. The maximum latency, therefore, is one less than the number of physical neurons (when only one is used). In another possible implementation, another predefined pattern can be used: for example, half are enabled and the other half are not. In this way, the maximum latency is half, but the average latency in optimal parameter choices is increased (that the number of neurons in the layer model is a multiple of the number of physical neurons).
[0035]
[0036] -The processor (or IP core) uses virtual addresses, so that its internal components work in terms of matrices, columns and rows, which facilitates the development of applications based on algorithms that integrate matrix algebra operations. The neural processor (IP core) includes "bridge" modules to manage the IP core itself directly from external ports, such as AXI ports, which are the standard in some products (chips, ASICs), such as Xilinx and ARM. Furthermore, since the management of large volumes of data (ie, large arrays) is critical in current systems, the processor has been provided with a configurable and programmable interconnection block (MAI, which is described below) specifically designed to allows to manage internal memory blocks as external.
[0037]
[0038] As a result, the designer of a SoC architecture can integrate multiple blocks of computing and storage, and transparently perform performance tests by dynamically allocating memory, to obtain the most efficient solution. This is achieved by means of a set of tables and microblocks in the MAI that offer a wide variety and granularity when allocating not only the matrices, but each row / column.
[0039]
[0040] In addition, the processing module of the invention is based on the reuse of multiplication-accumulation blocks (MACC blocks), which in this text is sometimes referred to as folding layers. In this sense, the parameters of the layer model (sometimes called 'weights' or 'gains' of the neural interconnections), which are adjusted in a previous phase of training or learning of the network, are loaded into the scratchpads through from the same port as the input data (network input vectors). In implementations of the invention, the control module can keep the scratchpad of the input vector in standby until all the parameters have been loaded, so that once the computation starts, it is not interrupted by the loading of parameters. The final results can simply be saved in memory through the output port, or fed back to the scratchpad input. The choice depends on the specific layer model that is being implemented. For example, in more compact solutions where a slower execution time is acceptable, a vector can be fed back several times to process the effect of multiple layers using only local memory.
[0041]
[0042] Although the design of the processing module is independent of the target technology in which it is to be integrated, in embodiments of the invention, the high degree of parameterization of the design description of the IP core and its enormous scalability, make the IP core can be integrated in different ways in different technologies. On the one hand, the already mentioned FPGAs and PSoCs, which are "prefabricated" devices with high configurability, so that the IP core is designed to adapt optimally to the architecture characteristics of the systems-on-a-chip (SoC) on FPGA, and more specifically to the characteristics of the AXI bus. Thus, the IP core can be embedded both in FPGAs of small size and low cost as in FPGAs of greater size and performance. On the other hand, the description code of the processor, described in the standard hardware description language VHDL (acronym that results from combining VHSIC ( Very High Speed Integrated Circuit) and HDL ( Hardware Description Language)), provides great portability, so that the processing module is not only integrable in PSoCs or FPGAs architectures (either integrating it with soft processors) or in so-called PSoCs (Programmable SoC), which contain a hard processor (hard processor), such as the Cortex-A9 of ARM integrated into a Zynq device from Xilinx), but is also integratable in an ASIC with SoC architecture together with other processors and acceleration modules. That is to say, although the preferred target technology for which the processing module (or processor) of the invention has been designed is an FPGA, since the description code of the processor has been developed with the aim of making an optimized use of the resources of the FPGAs (such as memory blocks (BRAM), arithmetic units (blocks DSPs) and managers / synthesizers / clock dividers (Mixed-Mode Clock Manager, MCMM)), the processing module of the invention can also to be assigned to ASIC for its integration in SoC. Note that recently they are commercializing products in the market that integrate IP blocks of FPGAs in ASICs (that is, configurable IP blocks and adaptable to different manufacturing technologies), commonly known as eFPGAs, so that the manufacturers of SoCs in ASIC integrate said blocks IP in their chips so that they have a reconfigurable logic zone in the style of FPGAs (it is an integrated FPGA integrated as part of an ASIC).
[0043]
[0044] A significant difference of the processor of the invention, with respect to those disclosed in the state of the art, is that the processor has been designed to optimize its operation in applications with streaming-type data entry. For example, the processor is especially suitable for processing hyperspectral images. Consequently, the maximum parallelization of the processing has not been sought, a question that conditions the use of adder-trees and the broadcasting of parameters to multiple instances at the same time, but rather the innermost loop is sequential. This fact has an impact on the size of the system, as well as on the execution time and maximum frequency of operation. In comparison with the known designs, the present processor results in solutions that occupy less area (fewer resources in an FPGA) and requires more clock cycles of computing, but in turn allows working at higher frequencies, which is compensated in Some measure the greater latency in cycles. This is because the goal is to adapt to the limitations imposed by streaming and take advantage of it to get the most out of small-size FPGAs. In addition, the present proposal avoids the imposition of inviolable relationships between parameters, which results in greater flexibility and Scalability
[0045]
[0046] On the other hand, since the present IP core is specially designed for systems with limited resources (small devices), the IP kernel control system does not impose restrictions on the relationships between the configuration parameters. It is common, for example, that the most extended (and complex) CNN networks use powers of two to establish the sizes of the network layers and the data sets to be processed. However, in applications with SLFN, the need to choose layer sizes of up to 2k with greater granularity has been detected. Thus, autoconfiguration tools generate a solution that guarantees the shortest execution time of the architecture for any network size, without entering padding data for its control. Because the enable signals are activated sequentially, the additional complexity in the architecture control elements is minimized. This is helped by the fact that the innermost loop (ie, the vector product) is processed sequentially, thereby reducing a dimension to be managed non-linearly. At the same time, it must be considered that the omission of the DMAs from the architecture and the use of optimized address formats offers a balance to compensate the use of resources.
[0047]
[0048] With respect to the displacement registers for data entry, a significant difference of the processor of the invention, with respect to those disclosed in the state of the art, is that a large number of multipliers are used and when computing in "wave" mode "The data starts to be used from the first to the first multiplier, without waiting for the records to be full.
[0049]
[0050] Another noteworthy advantage over other proposals is that the computation accuracy in the processor of the invention is adjustable, that is, configurable for each data set and, optionally, selectable at run time. In conventional proposals, the precision is fixed because it is assumed that the FPGA will be reconfigured for each neural network model (there is no configurability at runtime).
[0051]
[0052] On the other hand, several parameters of the present processor are configurable at runtime, that is, after the synthesis. Some of these parameters are the size of the block ( chunk) of data to be processed, number of inputs, number of neurons in the hidden layers to be processed, number of outputs and use (or not) of the activation functions in each layer. This means that the processor, once synthesized and implemented, is more flexible and can adapt to different network models without need to reconfigure the hardware.
[0053]
[0054] In addition, the present solution uses an intemconnection module (MAI) between the external memory and the IP core, specifically designed to efficiently and easily connect several IP cores in a heterogeneous SoC type architecture. In embodiments of the invention, this module (MAI) is based on Wishbone B4, ensuring its total portability and independence from the target technology.
[0055]
[0056] Another outstanding aspect of the present solution is that the IP core has been fully described using the standard VHDL language, which makes it agnostic to the design / synthesis tools and totally portable from a technological point of view, as well as allowing total control on each and every aspect of the design. Moreover, they have written a set of packages in this language that have allowed the practical total parameterization of the design, so that in reality a complete tool has been developed for designing and configuring the IP core that allows the system designer to use it easily. and integration. The set of scripts and user interfaces that make up these automatic parameterization tools are multiplatform (for example, Windows, Linux, mac) and generate standard VHDL, so it can be used both for synthesis in FPGA and for semi-custom ASIC.
[0057]
[0058] Finally, the processing module described in the present invention has integrated memory management.
[0059]
[0060] In a first aspect of the invention, a configurable and programmable processing core IP is provided for the computation of a plurality of matrix products, in which both the data to be processed and the results obtained are transferred in series, comprising: a block data entry configured to provide, from an input data, a set of vectors representing a first and a second matrix whose product is to be computed, using a virtual address format composed of matrix, row and column pointers, wherein said data entry block comprises: a first sub-block configured to obtain a row pointer (pROW) and a column pointer (pCOL); and a second sub-block comprising N components, where N is a natural number> 1, each of which comprises two chained counters corresponding to the number of vectors to be transferred and the length of said vectors, where each component uses local addresses; a memory block comprising N memory elements, each of said memory elements being associated with a respective output of said second sub-block of the data entry block; a matrix-vector multiplier block in Fixed comma configured to implement a multiplication-accumulation operation to multiply a matrix by multiple vectors received in series continuously, where said fixed-matrix vector-matrix multiplier block comprises a set of sub-blocks, where each of said sub-groups blocks comprises a plurality of arithmetic modules; a block comprising at least one activation function configured to be applied to the output of said fixed-matrix vector-matrix multiplier block; a block for storing in storage components the outputs of the at least one activation function and for reading the outputs of said storage components; and a data output block using a virtual address format composed of matrix, row and column pointers, comprising a row counter and a column counter.
[0061]
[0062] In embodiments of the invention, the first component of said second sub-block is configured to provide a number of vectors equal to the number of consecutive matrix-vector products that it is desired to compute.
[0063]
[0064] In embodiments of the invention, the second to last components of said second sub-block are configured to provide a number of vectors equal to the number of passes that must be made with the corresponding DSP.
[0065]
[0066] In embodiments of the invention, the fixed-point vector-matrix multiplier block is based on a linear systolic array with parameter loading in parallel and wave-like execution.
[0067]
[0068] In embodiments of the invention, said N memory elements comprised in said memory block are N BRAM blocks.
[0069]
[0070] In embodiments of the invention, each sub-block or group of said fixed-matrix vector-matrix multiplier block comprises a multiplexer at its output.
[0071]
[0072] In embodiments of the invention, each sub-block or group of said fixed-point vector-matrix multiplier block comprises at its output as many shift registers as arithmetic modules each sub-block has.
[0073]
[0074] In embodiments of the invention, said arithmetic modules operating in parallel generate, every z cycles, as many data as there are number of arithmetic modules, where z is the length of the vector.
[0075]
[0076] In embodiments of the invention, the parallel execution of said arithmetic modules is controlled by a state machine that takes as reference only the first arithmetic module. In this case, the state machine can use three counters of support: vector length, number of repetitions and latency of the arithmetic module.
[0077]
[0078] In embodiments of the invention, said fixed-point vector-matrix multiplier block represents at least one hidden layer of an artificial neural network.
[0079]
[0080] In embodiments of the invention, the IP core comprises means to feed back the output of said FIFO block to perform at least two consecutive matrix-vector operations with an intermediate filtering, such that with a single instance of the fixed-point matrix-vector multiplier block both the at least one hidden layer and the output layer of the neural network are processed.
[0081]
[0082] In embodiments of the invention, M arithmetic modules are used to perform h vector multiplications, where h is the number of rows of the matrix in the matrix-vector product, so if h> M, more than one pass is required, iteration or repetition to process each input vector.
[0083]
[0084] In embodiments of the invention, the IP core further comprises an interconnection core configured to facilitate the integration of the IP core in a heterogeneous system with one or more coprocessor modules, said interconnection core configured to manage memory blocks internal and external to the core. IP.
[0085]
[0086] In embodiments of the invention, all operations are carried out in a fixed point of adjustable precision, being configured to define the word size used in each of the matrices that intervene in each execution.
[0087]
[0088] In a second aspect of the invention, a SoC ( System on a Chip) architecture is provided which incorporates a processing module as described above. In embodiments of the invention, the processing module is integrated into an FPGA. That is, an on-chip system (SoC) comprising at least one IP core as described above is provided.
[0089]
[0090] In another aspect of the invention, an FPGA is provided comprising at least one IP core as described above.
[0091]
[0092] In another aspect of the invention, there is provided a method of designing an IP core as described above, suitable for an objective technology, comprising: generating a netlist comprising a parameterized description of the IP core suitable for said target technology; synthesize a neural network that you wish to implement, adapting the IP core to the available resources in said objective technology, where said adaptation is carried out by means of a folding technique and reuse of neuronal layers; Once a certain size of neuronal network has been synthesized, at runtime select a number of neurons that you want to use in each network layer.
[0093]
[0094] The neuronal processor of the invention is applicable in any RNA- based machine learning application in which it is convenient to perform an accelerated processing of large volumes of data and, more particularly, in embedded embedded systems with requirements of small size and weight and great integrability. Consequently, the applicability of the neuronal processor of the invention is very wide: among others, classification of objects in embedded vision systems (tumor detection, detection and identification of objects, pedestrian detection, autonomous cars, drone guidance, detection of targets , hyperspectral image processing, etc.), since machine learning applications cross all vertical markets, from military / aerospace, through automotive, industrial, medical instrumentation, to large data processing centers (internet, cloud computing, IoT). The neuronal processor of the invention can also be used in deep learning applications , for example for embedded vision, since the processor can be easily configured to work as a neural network with multiple processing layers, even with convolution layers or recurring layers.
[0095]
[0096] In sum, among the advantages of the proposed IP core, it can be highlighted that it has been conceived and made to be easily integrated as an IP core in the design of SoC architectures together with other processing modules (including, of course, the microprocessors) . Thus, it is noteworthy that a SoC designer who wants to use it, should only adjust a series of configuration parameters at a high level (aspects of the network model that he wants to implement as well as restrictions on the hardware resources available for implementation) and the code will be autoconfigures to generate a processing structure suitable for these requirements (low level adjustments). Moreover, the design allows, once the processor is implemented, some of its functionalities are programmable at run time.
[0097]
[0098] Advantages and additional features of the invention will be apparent from the detailed description that follows and will be pointed out in particular in the appended claims.
[0099]
[0100] BRIEF DESCRIPTION OF THE FIGURES
[0101] To complement the description and in order to help a better understanding of the characteristics of the invention, according to an example of practical realization thereof, a set of figures is included as an integral part of the description, in which illustrative and not limiting, the following has been represented:
[0102]
[0103] Figure 1 represents a conventional artificial neural network (RNA). Specifically, we have illustrated a typical architecture of an artificial neural network type Single Hidden Layer Feedforward Network (SLFN), as well as the matrix representation of the phases of inference and training of the direct model Extreme Learning Machine (ELM).
[0104]
[0105] Figure 2 illustrates a block diagram of an IP core according to a possible embodiment of the invention.
[0106]
[0107] Figure 3 illustrates a block diagram of a possible system in which the IP core of Figure 2 can be integrated.
[0108]
[0109] Figure 4 illustrates a block diagram of a bridge modules for connecting interconnection modules based on Wishbone B4 with interconnection modules based on AMBA, according to a possible embodiment of the invention.
[0110]
[0111] Figure 5 illustrates a possible implementation of several modules of the block diagram of Figure 2, according to the present invention.
[0112]
[0113] Figure 6 shows in detail the modifications of the FIFO reading interfaces corresponding to the input matrix and to the weight matrix, according to a possible embodiment of the invention. That is, the access patterns for the processing of three consecutive matrix-vector products in a four-pass (waves) problem are shown graphically.
[0114]
[0115] Figure 7 illustrates an example of implementation of an IP core according to a possible embodiment of the invention.
[0116]
[0117] Figure 8 illustrates a block diagram of an IP core according to another possible embodiment of the invention.
[0118]
[0119] Figure 9 shows a schematic of the component used to obtain row and column pointers in the sub-block intercon_si, to decode the local address based on the position and given indices, according to a possible implementation of the invention.
[0120] Figure 10 shows three possible clock domains used by the IP core according to embodiments of the invention.
[0121]
[0122] Figure 11 illustrates an example of combining multiple instances ( stack) of a Neural coprocessor according to embodiments of the invention, to reduce latency in applications with sufficient logical resources. It is observed how multiple copies of the IP kernel can be instantiated to compute multiple layers at the same time.
[0123] Figure 12 illustrates two modified DSPs, according to possible embodiments of the invention.
[0124]
[0125] Figure 13 shows possible implementations of the module 15 of figure 2, based on DSPs with the corresponding saturation modules, since it is fixed-point arithmetic. Two variants are shown to manage the outputs: by means of registers of displacement and by means of multiplexers.
[0126]
[0127] Figure 14 schematically represents different interconnection variants to adapt the input bandwidth to the requirements of the application.
[0128]
[0129] Figure 15 represents the execution with a wave-like pattern and its subsequent serialization at the output, which is a relevant feature of the operation of the IP core of the invention. It is a chronogram, as an example, of the execution of the IP core, where two design characteristics are observed: i) the results are generated in a wave-like pattern; and ii) the automatic derivation of most of the internal parameters of the architecture minimizes the impact of bottlenecks like the one shown in this example.
[0130]
[0131] Figure 16 illustrates an example of combining multiple instances of module 15 of Figure 2 to reduce latency.
[0132]
[0133] DESCRIPTION OF A MODE FOR CARRYING OUT THE INVENTION
[0134]
[0135] Figure 1 represents a conventional artificial neural network (RNA) that represents a computational model whose processing requires a large computing capacity that can only be met using specific processors of high performance and high efficiency. The processor or IP core of the present disclosure is specially designed to compute complex computational models, such as, but not limited to, the one represented in figure 1. In the upper part of figure 1 the typical architecture of a network is shown artificial neuron type Single Hidden Layer Feedforward Network (SLFN). It is a network of type 'shallow', which means that there are not many hidden layers (in this case there is only one, referenced in the architecture illustrated as "hidden layer"), in contrast to deep neural networks (DNN ), which have many hidden layers, in any case, with the design of the In this disclosure, networks with any number of layers can be implemented. Note that each of the connections between the input layer and the hidden layer has a value associated with it, represented as wh, f, while each of the connections between the hidden layer and the output layer has a value associated with it, represented as gch . These values, called weights or gains, represent a multiplication and are the main parameters of the network. The network is also defined through the number of nodes in each layer. In the illustrated example, the input layer has 4 + 1 nodes (4 input nodes bias value (the bias value is really a hidden layer parameter), the hidden layer has 6 nodes and the output layer has 3 An expert will understand that the number of nodes per layer can vary from one network to another.In addition to the weights and the number of nodes in each of the layers, an SLFN is defined by the type of activation functions used in the networks. artificial neurons described below In the context of the present disclosure, an SLFN is composed of two layers, since no arithmetic operation is performed on the one indicated as the input layer. generic form.
[0136]
[0137] In the hidden layer, each node is an artificial neuron. An artificial neuron is formed by the sum of all its inputs and the application of a non-linear transformation to the result, called the 'activation function'. There are multiple mathematical functions that can be used as an activation function for activation computation. By way of example, but in a non-limiting way, we quote logistics sigmoid (sig), hyperbolic tangent (tanh), radial base (RBF), linear-rectified (ReLu), among others. Note that throughout this text reference is made to the activation function as 'sigmoid', but this should not be understood as a loss of generalization, since any suitable mathematical function can be used to compute the activation. In fact, the design contemplates the synthesis of the mentioned functions. In some activation functions it is advisable to use one or several parameters, called 'bias', to heterogeneize the responses of the nodes, so that the implicit spatial projection is richer in details. This feature is illustrated in the figure as an additional node in the previous layer with a constant value. In the illustrated example, the output layer does not use bias, so it is not illustrated in the hidden layer. In the IP core of the present disclosure, the use of the bias can be activated for each layer by modifying a bit corresponding to the indicated constant. In the output layer, each node is a sum of all its inputs, with no activation function applied. Therefore, each node of the output layer can be interpreted as an artificial neuron with zero transformation at the output. Thus, in the same way as in the previous layer, the layer of The output can be selectable from a set of functions. In classification applications, it is usual to add an additional layer of a single node to the exit of the output layer to identify the output with maximum value. The IP core of the present disclosure contemplates this possibility. However, it is not part of the reference SLFN model (figure 1).
[0138]
[0139] The lower part of figure 1 represents from the algorithmic point of view the neural network model of the upper part of the figure. This model is represented as two successive matrix products, with a non-linear transformation (the one defined by the activation function) applied to the intermediate result. Specifically: 'I' is a matrix composed of rows corresponding to the 'v' input vectors. The length 'f' of these vectors 'v' is equal to the number of nodes in the input layer, not counting the biases; 'W' is the input weight matrix, where each row h corresponds to the parameters of each node in the hidden layer; 'H' is the matrix product I * W; 'T' is the result of the application of the non-linear function to each element of 'H';'G' is the output weight matrix, where each row c corresponds to the parameters of each node in the output layer; 'R' is the matrix product T * GT Therefore, henceforth we call a matrix product and an optional non-linear transformation 'layer'. In an IP core core version of the present disclosure, the core is focused on a single layer at a time, so computation of the inference phase of an SLFN involves two successive executions of the core. However, multiple copies of the IP kernel can be instantiated to compute multiple layers at the same time, as illustrated in Figure 11. Both solutions allow extrapolating their use to networks with any number of layers, since the hardware design is agnostic to the number of these; that is, independent of previous or subsequent operations. Additionally, in the figure the operation G ^t = T + * B is shown. This operation represents a training phase in case of using a specific training method called Extreme Learning Machine (ELM). In the description of the IP core of the present disclosure, this step is not described in detail, since the IP core implementation described here focuses on the inference or feedforward stage (not in the previous training phase). However, it is illustrated to show that the first matrix product and the non-linear transformation are shared operations in both phases. Therefore, the proposed design can be used in conjunction with a linear resolver to accelerate the training stage. In addition, the programmability in runtime allows to use the same architecture in both phases, training and inference, not being necessary the synthesis and implementation of two versions of different sizes. Alternatively, as a method can be used the Random Vector Functional-Link (RVFL) method, very similar to the one illustrated. The difference is that the matrix 'I' is added to the right of 'T', so that each input node connects directly to each output node with a certain weight, in addition to the hidden layer itself. This modification is maintained in the training and inference phases. The implementation of the proposed IP core does not differentiate the ELM from the RVFL from the point of view of the hardware architecture, since it has been designed to support both models by changing only the memory spaces used. Also, the user can perform additional transformations between stages to support other network models.
[0140]
[0141] In sum, any neural network that can be expressed as a sequence of matrix products with optional intermediate non-linear transformations (ie, any model of a feedforward neural network), can be 'mapped' to the IP core design of the present disclosure, which is described below. Once the layered network model is represented, the selection of parameters of the proposed IP core is based on the choice of the maximum values for each of the three dimensions involved in the products: the number of rows of two matrices and the number of columns of both (which should be the same). For example, when an SLFN is processed, these pairs of matrices are 'I, W and' T, G ', and the parameters to be defined are: v, max (f, h) and max (h, c). Additionally, you must choose the activation function (s) you wish to synthesize.
[0142]
[0143] Next, how the design and implementation of the IP core of the present disclosure is adapted automatically to compute layers of a neural network of different size with a fixed number (probably lower than the number of neurons of the different layers of the model) of DSP blocks (hereinafter simply DSP). Or what is the same, how to map the conventional schema of a neural network, probably multilayer, to the proposed implementation of IP core (hardware): The model of the network is divided into layers, each of these being a matrix product with a non-linear activation function optionally applied to the result. The layers of the neural network can be implemented as multiplication-accumulation operations ( multiply-accumulate operations), that is, computing the product of two inputs and adding that product to an accumulator. The hardware unit that performs this operation is a MACC or MACC unit (multiplication accumulation or Multiplier-ACCumulator). In an FPGA or ASIC, each DSP performs a vector product. Therefore, each DSP is equivalent to the first computation phase of a 'physical neuron' (Multiply-Accumulate operation or MACC). To this we must add a non-linear transformation (for example, usually sigmoidal, although this is also configurable), which is done in a module located below in the datapath; This non-linear transformation is called the activation function of the neuron. However, the design is designed so that the non-linear transformation modules are not a limitation for the throughput that DSPs can demand. On the one hand, the design is segmented (pipeline) to accept a data for each clock cycle. On the other hand, there are as many instances as data in parallel can generate the previous module (15, maccs, which is described in relation to figure 2).
[0144]
[0145] The number of neurons of the model of each neuronal layer is divided by the number of physical neurons (or DSPs that you want to use in the final implementation), to obtain the number of necessary folds or folds; this supposes a partial serialization of the processing. An input vector is loaded and all the physical neurons are wave-shaped. If necessary, the vector is reread 'fold' times, using in each case all physical neurons with different parameters. Note that it is necessary when the number of DSPs is less than the number of neurons in the layer being processed. This is called layer folding. In addition, if there are more layers to process, then these DSPs are also used again to process the next layer (which may require a larger or smaller fold, or none at all). However, multiple copies of the IP core can be instantiated, in which case the DSPs may not be used for different layers (yes for folding). If the number of neurons of the model pending computation in a wave is less than the number of physical neurons, only the necessary ones are used. This is true even when 'fold' is zero. That is, when the model of the layer has a number of neurons smaller than the number of DSPs implemented / synthesized (maximum parallelism).
[0146]
[0147] For example, in an SLFN with k entries, hidden layer with 3 hidden neurons and 1 output, that you want to implement using only 2 DSPs, the model is processed in the following way: For the hidden layer, the first two neurons are computed. the hidden layer [fold 0], the remaining neuron of the hidden layer [fold 1] is computed and the non-linear activation function is applied to the three results. For the output layer, the output neuron is computed [fold 0] and the non-linear activation function is applied to the result.
[0148]
[0149] Note that most of the operations that affect a layer are performed in parallel (several folds and the activation function), even though the exposure in the form of a list expresses a sequence. Note also that the fold is independent of the number of nodes of the input layer to the network, that is, the number of elements in each vector product.
[0150]
[0151] The automatic adaptation of the processor (IP core) to the neural network model is carried out in two phases: in synthesis, by the "folding" of the layers to reuse the hardware resources in an optimized way (and the consequent adaptation of the control and data flow) ), and at runtime (once the processor is implemented) by using a series of programmable configuration registers, which allows adjusting the size of the layer to be processed and the type of activation function, writing in said registers.
[0152] Next, an IP core according to an implementation of the invention is described, which optimizes the operations of the neural network in terms of memory access time by taking advantage of the location of the data (memory and internal registers), so that avoids continuous access to higher capacity memory modules but lower bandwidth (external memory). By using multiple scratchpads, the local bandwidth is also improved. Energy optimization is also achieved, which is a compromise between the occupied area and the computation time required. The configurability of this design allows to look for a desired relation between area and speed. Note that energy consumption is associated, mainly, with the frequency of operation (dynamic consumption), but also with the area occupied (static consumption). Therefore, the energy consumption depends on the number of DSPs used in the synthesis. The more DSPs used, the more parallel operations are performed, so it will not be necessary to work as fast as with a small number of DSPs.
[0153] Figure 2 illustrates a block diagram of a neural processor (IP core) 1 according to a possible embodiment of the invention. The core IP 1 is a suitable computational accelerator, among others, for the processing of artificial neural networks (RNA) of the "feedforward" type, in which the operations are performed with fixed-point arithmetic, specially optimized for applications where the data process is transferred in series The fact that all operations are performed in a fixed point means that it is optimized for an efficient use of logical and arithmetic resources (in the case of an FPGA) or of the occupied silicon area (in the case of ASIC), as well as of the latency of the calculations and, consequently, of the energy consumption.The block diagram also represents a highly parameterized design, which is self-configured in synthesis time and which is programmable in execution time, As it is explained below, this means that a designer who is going to use this IP 1 kernel must simply specify the characteristics of the model of the network and the target technology (available resources) at the time of synthesizing, and the IP 1 core will be conveniently configured to adjust those (characteristics of the network model) to it (objective technology). The source code developed in VHDL is fully parameterized by means of 'generic', 'generate' and 'package' statements. Thus, by modifying a small group of parameters, the code automatically calculates the word size (number of bits) required in each signal and circuit register. Also, some modules are added or deleted automatically, depending on the choice. For example, if only one activation function is synthesized, the logic for selecting it is not necessary, and it is eliminated; or, in case of using double precision game in the numerical representation, the DSP and saturation modules are adapted to manage both in an appropriate way. In summary, the user has at his disposal a set of parameters, at a high level, that define the model of the network he wants to implement and certain aspects of the hardware he wants to use, and the code associated with the IP core and design procedure of the present invention, automatically adjusts all aspects of the design of the processor 1 (internal or low level) so that it is synthesized in an optimized manner fulfilling the requirements imposed by the user. Once the processor (IP core) 1 is implemented, the runtime programming is achieved by replacing 'hard-wired' constants with registers and, at the same time, by facilitating a communication mechanism to read / write them. Specifically, one of the Wishbone ports of input to the neural processor 1 accesses the configuration registers in the control module. The protocol used is the same as in the data ports, so that records can be consulted / modified individually or en bloc. When larger, the control registers are multiplexed to optimize the number of connections to the subcomponents of the processor 1. Thus, it is part of the main state machine to update the registers in each subcomponent, as soon as the trigger is received. and before actually starting the computation.
[0154]
[0155] Before going into the details of each module or element of the core IP 1 of figure 2, in order to consider a possible context of use for which the core IP 1 has been designed, figure 3 represents a block diagram of a possible completion of a complete design based on So (P) C (Systemon (Programmable) Chip). Virtually all the modules / components illustrated in Figure 3 can be integrated into a single chip (either FPGA, ASIC or other). It is likely that the DDR memory (DDR3 in Figure 3) is designed to be external to the chip, although it is technically possible to include it in the chip. Note that the modules referenced as "AMBA Interconnect IP", "PCIe IP "," UART IP "," IP Timer "," IP DDR "," IP CPU "," DDR3 "and" PC "do not form part of the present invention, and therefore should not be considered as limiting, and are indicated to example mode of the context in which the IP core of the present disclosure is expected to be used, but none of these modules is necessary for the use of said IP core. However, the core IP 1 is designed to work with a main CPU (referenced as "IP CPU" in Figure 3) and with a high capacity external memory (by way of example, but not limitation, a DDR) . Returning to Figure 3, the modules / components 3-9 represent different possible types of memory blocks that can be used: single port ROM 3, single port RAM 4, dual port ROM 5, dual port dual port 6, Dual port RAM read / write and read 7; Dual port RAM read / write and write 8; and true dual port RAM 9. Each of these modules includes a submodule to interpret the Wishbone protocol used in the MAI (Matrix-Aware Interconnect). Figure 3 also shows some modules / bridge components 10, 11 that are detailed in figure 4. Like the BRAMs, each bridge module 10, 11 includes master / slave submodules to interpret the Wishbone and AMBA protocols. In particular, the first bridge module 10 comprises a master sub-module for interpreting the Wishbone protocol (WB master in Figure 4), two FIFOs and a slave sub-module for interpreting the AMBA protocol (AXI slave in Figure 4). In turn, the second bridge module 11 comprises a slave sub-module for interpreting the Wishbone protocol (WB slave in Figure 4), two FIFOs and a master sub-module for interpreting the AMBA protocol (AXI master in Figure 4) . FIFOs are used to maximize throughput. These bridge modules 10, 11 can be an optional part of the MAI. What is illustrated in Figures 3 and 4 as "Matrix-Aware-Interconnect" is the core of it. From the point of view of the source code, a hierarchically superior module can include the MAI and the bridging modules, in addition to some MCMM. The Wishbone and AMBA protocols are the way to communicate the neural processor with the CPU. Note that a processor that understands the Wishbone protocol does not require any 10, 11 bridges and could connect directly to the core of the MAI. Although in figure 3, only one instance of each module / component 3-11 is shown schematically, it is possible to instantiate multiple copies of any of them, or instantiate only one of them. Note that the modules and interfaces listed allow practically mapping any additional peripheral or coprocessor to the MAI. Thus, in Figure 3 the module / component 2 represents another processor or IP core, which could be included to complement the IP 1 core, such as, but not limited to, a linear resolver.
[0156] Returning to Figure 2, the core IP 1 includes a module or element 15 in which arithmetic operations are performed to compute vector products that make up a matrix product. The module 15 comprises a series of DSP modules, output saturation modules and registers. Optionally it can include multiplexers and the counter associated to each of them. Each of these DSPs can be implemented with a multiplier and an adder, in addition to auxiliary resources such as logic gates, registers and multiplexers. These modules are processing blocks usually integrated in the FPGA architectures. In the present design have been added saturation modules composed of a comparator and a multiplexer. Figure 12 shows two possible embodiments of modified DSPs 33a, 33b. These modified DSP modules have also been called DSP by association. The module 15 itself, in addition to the DSPs, includes a register of displacement at the entrance. Finally, displacement registers or multiplexers can be included at the output. These two variants are shown in Figure 13: to the left (reference 15a), by means of shift registers and to the right (reference 15b), by means of multiplexers 35. Figure 16 illustrates an example of combining multiple instances of the module 15 of the figure 2 to reduce latency.
[0157]
[0158] Module 16 is the module in which the non-linear transformation is performed, which is optional. Modules 12-14 implement information management to read the input data. The function of these modules is to receive the data of the two matrices with which the product must be calculated. Module 12 is the interconnection (shared bus or crossbar) and the controller associated with each FIFO. It is detailed in Figure 5. Module 13 represents a set of multiplexers. Its main use is feedback. Module 14 is the FIFOs / scratchpads.
[0159]
[0160] Figure 5 illustrates a possible implementation of the input modules 12-14 according to the present invention. The module 12 is formed by a first block or submodule 32, which is a shared bus or crossbarswitch type interconnection, at the choice of the integrator or designer, and by a set of control modules 331-335. These controllers are transfer controllers, and what they do is adapt the Wishbone protocol to the FIFO interfaces. Therefore, they are 'memory controllers'. All 331-335 are architecturally equal. In Figure 5, the number of ctrl blocks 331-335 and their associated FIFOs is not necessarily 5, since this depends on the implementation. In general, the number of blocks is # DSP + 1 (number of DSPs used plus one). This same consideration is applicable to modules marked 142-145 (note that 141 is different from the rest). With respect to module 12, the figure 14 schematically represents different variants 12a, 12b, 12c, 12d of interconnection to adapt the input bandwidth to the requirements of the application. Block 14 is formed by a set of modules 141-145 which represent a set of slightly modified circular FIFO memories: There is an additional register compared to conventional implementations. This register replaces the reading pointer in the generation of the signal 'empty'. These memories, together with the controller f_ctrl 331-335 of module 12, controller 331-335 associated with each of them, act as temporary storage (scratchpads) of the data that make up the matrices to be processed, so that the locality is used both spatially and temporally of said data. A scratchpad is a type of temporary memory, similar to a cache, but with a much simpler control logic. Even so, it implies certain logic of address management. Typically the managed addresses belong to the memory space of the microprocessor. In the case of the present processor, when interposing the MAI, the memory space is not that of any microprocessor, but has the size necessary to differentiate the volume of concrete data to be managed. In short, a set of shared records can be considered, without more information about their origin or destination. The module 13 is, from a functional point of view, a multiplexer. Its function is to give way to feedback 30 (figure 2). It is the module that allows to transfer data directly from block 18 to 14 (figure 2). The pattern of access to FIFO memories 141-145 is illustrated below in relation to Figure 6.
[0161]
[0162] Modules 17-18 (figure 2) implement information management to save the results. The module 17 unifies the output channels of the module 16 in one or more streams. It does so in such a way that the result is ordered as if the processing had been carried out in series with a single DSP. Module 18 is an interconnection analogous to 12 (32). That is, a shared bus or crossbar switch, with one or more transfer controllers. As already mentioned, the feedback data leaves module 18, enters module 13 and from there to its destination (module 14). Note that later, in figure 7, the element 180 is included in the module 17 (figure 2). In turn, the element 190 in Figure 7 corresponds to the block 18 in Figure 2. The module 19 is the control / orchestration module. In Figure 7, below, it is represented as control module 185. It contains configuration registers for programming the behavior of the IP core at runtime. Next, the connections between modules are explained, what information is transmitted and what protocol can be used in each connection. The connection 20 represents the ports through which the content of the two matrices used to compute the product of the peripherals is read (preferably memories) connected to the MAI and / or to the AMBA Interconnect (see figure 3). In a possible embodiment, these ports 20 are Wishbone B4 read-only masters. Preferably, the Wishbone B4 read-only masters implement the virtual addressing format of the MAI, through those who read the content of said matrices. The number of ports 20 is defined by the user.
[0163]
[0164] Between the modules 12 and 13 (the module 12 is equivalent to the swi module 120 of figure 7 and the module 13 corresponds to a generalization of the module 130 of figure 7) two interfaces 21, 22 FIFO (First In First Out) of writing. The interface 21 corresponds to the input matrix and the interface 22 corresponds to the weight matrix. The number of ports is automatically derived from the parameters defined by the user. Between the modules 13 and 14, two writing interfaces FIFO 23, 24 are established. As in the case of interfaces 21, 22, interface 23 corresponds to the input matrix and interface 24 corresponds to the weight matrix. As for the interfaces 25, 26 between the modules 14, 15, they are slightly modified FIFO interfaces that, as in the previous cases, correspond respectively to the input matrix and the weight matrix. The aforementioned modifications are detailed in figure 6. The number of ports is automatically derived from the parameters defined by the user. Between the module 15, in which the arithmetic operations are performed to compute vector products that conform a matrix product, and the module 16, in which the non-linear transformation is optionally performed, a FIFO 27 writing interfaces are established, through from which the output of the matrix product of the DSPs of the module 15 is transferred to the optional activation functions of the module 16. The number of ports is automatically derived from the parameters defined by the user. Between the module 16, in which the non-linear transformation is optionally performed, and the module 17, write FIFO interfaces 28 are established. In implementations of the invention, the module of the activation functions 16 can be designed to be transparent with respect to the interface 28. The data transmitted on this interface 28 is the final result of the layer (either the hidden layer). , which is computed first, or the output layer, which is computed later). The number of ports is automatically derived from the parameters defined by the user. Between the module 17 and the module 18, FIFO writing interfaces 29 are established. In implementations of the invention, the module 17 can be designed to be transparent with respect to the interface 29. The data is transmitted to the module 17 through the interface 28. This data can optionally be serialized and / or reordered. The number of ports is defined by the user. Between module 18 and module 13, 30 FIFO interfaces are established. read and write that provide feedback to the IP core 1. The data is transmitted to the module 18 through the interface 29. The number of ports is defined by the user.
[0165]
[0166] The connection 31 represents the ports through which, for example, peripheral modules, the result of the computation executed in the IP core 1 is written. The peripherals in which the result is written can be, for example, memories. These peripherals are usually connected to the MAI and / or to the AMBA Interconnect (see figure 3). In a possible embodiment, these ports 31 are Wishbone B4 write-only masters. Preferably, the write-only Wishbone B4 Masters implement the virtual addressing format of the MAI. The number of ports is defined by the user. On the other hand, the dashed lines between the control module 19 and the modules 12-18 represent ad-hoc connections for the distribution of execution parameters from said module 19. The type of interface that implements these connections is preferably addressable memory / registers . Since its depth is preferably between 2 and 6 directions, the impact of the addressing is negligible. Finally, the bidirectional arrow to the left of module 19 is a write / read port used to program the coprocessor (core IP 1) at run time. Through it modifies the registers of the module / component 19, which are then interpreted and distributed automatically to the rest of the components. In a possible embodiment, this port is a Wishbone B4 slave. In other words, the runtime programming capability is achieved by providing a write / read port that allows interacting with the module 19. This includes a series of registers and several basic state machines, such as state machines based on two. or three status bits, which allow to automate the modification of the behavior of the architecture.
[0167]
[0168] Figure 6 illustrates the access pattern to the FIFO memories 141-145 shown in Figure 5, according to a possible implementation of the invention. Specifically, the access pattern to the FIFO 141-145 memories has been represented in an example that corresponds to the multiplication of a three-row matrix by another of four rows (being independent of the number of columns, which must be equal) taking into account account the possibility of applying the folding of a layer of a neural network model in a given number of DSPs. The reference 14 (i) refers to the first FIFO memory 141 of FIG. 5, while the reference 14 (b ) Refers to the rest of the FIFO memories 142-145 of FIG. 5. As can be seen, the memory 141 (14 (i) in Figure 6) shows a larger temporal location of the data, reading the same vector several times consecutively. In the case of the memory 142-145 (14 (b ) In figure 6), there is an equivalent spatial locality (in both cases the reading is vectorial), but a behavior less favorable to the temporal use is observed. Note that, in case of using unmodified circular FIFOs, each vector can only be read once before it can be overwritten. Therefore, the modifications made are based on the duplication of the reading register: In memory 141 (also called 14 (i)), there are two additional single-bit signals to a current FIFO memory. A first of them saves the value of the read pointer in the backup record, which is the one used to general the full signal. The second of these signals allows the effective reading pointer to be returned to the registered value. In memory 142-145 (also called 14 (b )), There is a single additional signal. Since the reading of the vectors corresponding to multiple consecutive folds (folds) is sequential, only the signal that allows returning to the beginning is available. Additionally, in applications that require processing products with different matrices (such is the case, for example, when private kernels are used instead of public kernels in deep networks), memories 142-145 (also called 14 (b )) Are they can be implemented with the same modifications as the memory 141 (also called 14 (i)), so that all the available space is used, loading kernels as soon as the previous ones are no longer used.
[0169]
[0170] Figure 7 illustrates a possible example of implementation of an IP core 100 according to the invention. This example responds to an implementation according to certain parameterization. Note that there are many alternative possibilities for synthesis. The input data, which is transferred through port 101 (20 in Figure 2) are the vectors that make up the two matrices whose product is to be computed. The data corresponding to one of the matrices is transferred only to the FIFO 'i' (140i in Figure 7, 141 in Figure 5) of block 140 (block 14 in Figures 2 and 5) while the data corresponding to the another matrix is distributed in the FIFOS 'b0 ... b9' (140b0 ... 140b9 in figure 7) of the block 140. The sub-module 320 (32 in figure 5) of the block 120 (block 12 in the figures 2 and 5) can be a shared bus or a crossbar switch, as configured. In the case of a crossbar switch, more than one input port is available, so several instances of the ports are implemented for data entry 101. The interconnection arbitrator, which is implemented in the block of intercon_si (320 in Figure 7, 32 in Figure 5), acts as a double master and converts the indices provided by modules 3301-3311 of block 120 (converts the Wishbone interface into the FIFO interface). Note that acting as double master means that both the MAI 200 and the 3301-3311 modules are slaves, so it is the referee who initiates and finishes for transfer. Modules 3301-3311 are protocol converters, in this example, from Wishbone slave to FIFO interface. The address is composed of two indexes, which are chained counters. The account limit value of each of them depends on the value of the control records at run time.
[0171]
[0172] The IP 100 core has a fixed-matrix vector-matrix multiplier block (maccs block) 150, which implements the MACC (Multiply ACCumulate) operation based on a linear systolic array with parameter loading in parallel and wave-like execution. The maccs150 block is optimized for the multiplication of a matrix by multiple vectors received in series continuously (stream), practically performing a matrix product. The choice of the input and output stages to it, as well as the variety in the control logic, allows configuring this IP 100 core to perform the processing of different models of feedforward type RNAs such as, but not limited to, the "Extreme Learning Machines", the "Random Vector Functional-Link (RVFL)", and, in general, any RNA of multilayer type, including the "Deep Neural Network (DNN)" or the "Convolutional Neural Network (CNN)" .
[0173]
[0174] Since today most mass storage devices (memories), with sufficient capacity to store the volume of data required in the target applications, and with access time that allows execution in real time, have serial interfaces, in the design of the illustrated IP 100 core this feature has been imposed as a design condition. This implies that the circuit must consider the need to distribute the data received in series through the interface 101 (or interfaces, if the module 320 is a crossbar) to multiple arithmetic modules working in parallel. Due to the requirement to parallelize the data received in series, it is not expected to have more than one element of a vector available at the same time. Thus, we talk about receiving in series each data (which will consist of multiple bits), not every bit. Likewise, the results must be collected and serialized to send them back to the main memory through the output interface 102 (or interfaces, if sub-module 171 of block 170 is a crossbar).
[0175]
[0176] In embodiments of the invention, in order to take advantage of the location of the data, in block 140 BRAM blocks 140i 140b1-140b9 are used in the form of FIFO, as indicated by the name of the fifos140 block. The usual strategy is to wait for the receipt of a block. number of data equal to the number of modules working in parallel and start the computation (block 150) at the same time in all of them. The number of modules working in Parallel is given by the number of DSPs that are used in block 150. In addition, the number of outputs of module 320 is always one more than the number of DSPs, since the additional output of module 320 is the input to the shift register R1 -R9 (which, in turn, goes to each DSP). Therefore, if the circuit latency is less than the time required to load each group of data, there are periods of inefficiency in which the arithmetic modules are not used. In any case, there is always an initial latency. On the other hand, at the end, all the data is available at the same time, so, in a possible realization, serializing them requires going through each of the outputs. In an alternative embodiment, the output registers can be linked as a shift register, so that no multiplexer is required. These two alternatives are illustrated in Figure 13, which have been referenced as 15a and 15b. In the particular example of Figure 7, the maccs150 block has been divided into three groups 150a 150b 150c. The first two 150a 150b have four DSPs each (DSP0-DSP3 sub-block 150a and DSP4-DSP7 sub-block 150b), while the third 150c has two DSPs (DSP8-DSP9). The number of groups is automatically calculated in synthesis time from the parameters provided by the user. Each sub-block has a multiplexer MUXa, MUXb, MUXc to serialize the data of the respective sub-block 150a 150b 150c. The maximum number of DSPs in each group is equal to the minimum expected length in the input vector, which is a parameter indicated prior to the synthesis. In the example of figure 7, the value of said parameter is 4. Whenever the remnant of the division between the number of DSPs and this parameter is different from zero, the last group or sub-block will have a smaller size. In this case 10 rem 4 = 2. Figure 8 illustrates a block diagram of another example of core IP 100 'according to another possible embodiment of the invention, in which the output registers are linked as a shift register , therefore multiplexers are not needed.
[0177]
[0178] A novelty that can be highlighted in this core IP 1, 100, 100 'and design procedure thereof, is that an inverse analysis of the sequencing is performed, that is, the adjustment of the throughput and latency requirements in each of the Modules are made from the maximum write capacity. That is, the architecture automatically adjusts to the constraints imposed in its parameterization. The main effect is the automatic derivation of the number of modules 'sigmoid_wp' from the block of activation functions 160 (block 16 in figure 2) and, from the number of outputs of module 150 (block 15 in figure 2), which depends of a synthesis parameter, as explained. Specifically, the MACC 150 operation performed on the DSP0-DSP9 modules compresses z elements into one, so a single port that writes data to series to one data per cycle can manage the outputs of the arithmetic modules (each DSP and its saturation block). Note that z is a programmable parameter at runtime, and corresponds to the length of the vectors with which the product is computed. Referring to figure 1, specifically the values that can take z are: z = f; z = f + 1; z = h; oz = h + 1, depending on whether the incoming or outgoing network is being computed and whether bias is being used or not. As explained, the vector product is computed sequentially (hence this parameter z is not reflected in Figure 7). In the architecture, z is implemented by a register located in the control module 185 (module 19 in figure 2). Since the output of one of them must be read in each clock cycle, it is reasonable to think that it is an inefficient strategy to generate results in several of them at the same time. In a preferred embodiment, for efficiency, each module (each DSP and its saturation block) generates the valid result in the cycle immediately following the previous module, unlike what happens, for example, in the design of DianNao. Since all of them have the same latency, so that the results are generated with a difference cycle, the computation must start in the same order. The time evolution of the control signals fulfilling the above requirements allows observing a wave type pattern, such as the one illustrated by way of example in figure 15, characteristic of the displacement registers. In the proposed design, this pattern is used to connect a single data entry port 140out (output of element 140i in the fifos140 block) only to the first DSP0 of the block maccs140 directly, and to the rest of the DSPs DSP1-DSP9 through from a series of chained records R1-R9. Thus, the complexity of routing and the required fanout is reduced. Also, reducing the length of the tracks allows the operation with higher clock frequencies. In addition, the computation in the first DSP0 DSP starts as soon as the first data is available, so the initial latency is independent of the number of arithmetic modules used in parallel; it is reduced to the length of the vector.
[0179]
[0180] The arithmetic modules operating in parallel generate up to #DSP data (as much data as there is DSP number) every z cycles, where z is the length of the vector. Therefore, if #DSP> z (if the DSP number is greater than z), a single output can not collect all the generated results. Therefore, a design parameter is the minimum length of the vectors to be processed. Based on it, ceil synthesis times (# DSP / g_z) output ports are generated, and g_z is a parameter that indicates the minimum value that can take z, where ceil is an up-rounding operation, and is added a multiplexing module MUXa MUXb MUXc in figure 7 between each group of g_z DSPs and the corresponding port. The maximum size of the groups 150a, 150b, 150c is conditioned by g_z. In the example shown, each multiplexing module is implemented by a multiplexer and a small state machine based on a counter. The internal synchronization signals are adjusted automatically, also considering the cases in which the remainder of the division is not zero, so instances of different size must be managed, as represented by sub-block 150c in figure 7. There are two possible implementations of these multiplexing blocks: As mentioned, in possible implementations of the invention, as in the one illustrated in figure 7, the multiplexing block MUXa MUXb MUXc is implemented by a set of multiplexers and a counter. This implementation does not imply any latency but, depending on the size of the chosen group, the logical requirements can be considerable. In other implementations of the invention, such as that illustrated in Figure 8, the multiplexing block is implemented by means of a shift register R'0-R'9 with independent loading per record and the maximum operator counter in the next block sigmoid_wp of activation function block 160. A characteristic of this design is that the control system is very simplified (it can be said that it is a state machine with only four states) and has practically no programming. The design has been carried out in such a way that when the architecture to be implemented is generated, the necessary elements (counters) are automatically created for the generation of the control signals that manage the data flow. In this case, a simple counter acts as a state machine: as soon as the data of the whole group is available, the output of the shift register is read sequentially, as many times as the group has outputs. When the last one arrives, a pulse is generated that 'passes the relay' to the counter of the next group. In this implementation you have to wait until the group is written to start reading. The maximum group size is g_z / 2, instead of g_z. The lines are much shorter and the routing requirements are simplified.
[0181]
[0182] Given this division (that is, the grouping of the arithmetic modules), several groups of outputs (of multiplexing blocks or equivalent displacement registers) can be connected in parallel, so that the input port is connected in parallel to so many DSP as groups there. Figure 14 schematically represents this possibility of multiple instantiation of the module 15. This makes it possible to establish a compromise between the routing requirements and the latency until reaching the maximum efficiency of the circuit. It is an especially interesting feature for 3D implementations (stacking of successive layers of 2D processing), it is say, implementations in which the calculation of an indeterminate number of vector products is distributed among a smaller number of arithmetic units. It can be interpreted as a 'folding' or second level folding, since it is the same 'loop' in software. The control logic that orchestrates the execution in the different modules in parallel has been optimized by designing a state machine that takes as reference only the first DSP0 DSP. This machine uses three support counters: length of the vector, number of repetitions and latency of the arithmetic module. Considering that the sequence in the rest of the modules is identical, although delayed, a three-bit shift register is preferably used to transfer the control signals. This supposes an additional requirement with respect to the solutions that execute all the arithmetic modules with the same sequence. However, when the number of these is high, for the reasons previously described, it may be advisable to introduce registers to reduce the length of the tracks and simplify routing. Therefore, this is a compromise situation in which we have chosen to use a few more registers for the design, in anticipation of configurations that require a high number of arithmetic modules operating in parallel.
[0183]
[0184] The execution of the module 150 is solidary, that is, all the DSPs work at the same rhythm as long as i) none of the memories that are required to read in a given cycle is empty and ii) none of the outputs to which requires writing in a given cycle is full. For this, the "empty" signals of the memories of the fifos module 140 are chained with "or" gates, in such a way that the complexity of routing is reduced. The "full" signals of the memories of the smerger_wp 170 module are evaluated at the same time, since choosing the synthesis parameters appropriately is unlikely to be full, and their number is very small compared to the number of DSPs The module smerger_wp 170 is formed by a group of simple FIFOs 170a, 170b, 170c and by a serializer (smerger) 171. This means that the datapath is a multiplexer, and there are two counters chained to produce an output order equivalent to that produced by a processor sequential Given the solidarity in the execution, the control logic is the same regardless of whether a 2D or 3D layout is used.
[0185] Figure 7 also shows an interconnection core 200, also referred to as MAI block ( Matrix Aware Interconnect), which has been designed in parallel to the IP 100 core to facilitate its integration in a heterogeneous system with one or several coprocessor modules. The interconnection core 200 allows to use the accelerator (core IP 100) in solidarity with other modules to also accelerate the training phase of some of the methodologies, and not only in the processing. The interconnection core 200 (block MAI 200) is responsible for the management of internal and external memory blocks. The interconnection core 200 is detailed below.
[0186]
[0187] The core design IP1, 100, 100 'is completely parameterized, which means that it is configurable before the synthesis. In a possible implementation, in which the processing module 10 is designed to synthesize a single-layer hidden network (SLFN ) with a single instance of the maccs15, 150 block, a set of parameters that represent the size of the synthesized network and a set of generic module parameters.
[0188]
[0189] The parameters that represent the size of the synthesized network are:
[0190]
[0191] x_m number of maccs. This is the number of DSPs in each block 15, 150. Note that the number of sub-blocks 150a 150b 150c is defined by ceil (g_z / x_m);
[0192]
[0193] x_s number of sigmoids. This is the number of sub-blocks 160a 160b 160c within the activation function block 160;
[0194]
[0195] x_vmaxvectors This is the maximum number of vectors that are expected to be processed from a run, which are the ones that would be loaded from an external element to the internal memory for processing as a 'stream'. Specifically, it defines the end-of-account value of one of the counters in the f_ctrl module 120 corresponding to the FIFO 'i' 140i of the fifos module 140. The module 150 is agnostic on the total number of vectors to be processed.
[0196]
[0197] x_f max features (inputs). This is the maximum number of elements that each vector of a vector product expects, that is, the maximum dimension of the input vectors that the neural network has to process (which is the number of nodes) which has the input layer of the network).
[0198]
[0199] x_h max hidden neurons. This is the maximum number of vector products that will be computed with the same input vector. In the case of an SLFN, it is the number of neurons in the hidden layer.
[0200]
[0201] x_c max classes | regressions (outputs). This is, in case of using the parameterization file for SLFNs, the number of nodes or neurons in the output layer. Note that normally x_c << x_h, so this parameter is a safeguard.
[0202]
[0203] g_z min features (inputs). It is the minimum number of elements that each vector of a vector product is expected to have. The relationship between this parameter and x_m defines how many groups 150 will be divided.
[0204]
[0205] The set of generic parameters of the module are:
[0206]
[0207] g_mltlatencia of MACC. It refers to the latency of an arithmetic module, that is, a DSP and the saturation module immediately after it. It must be modified if another description is used for them. In principle, it is not expected to be modified in most applications and is fixed.
[0208]
[0209] g_mp MACC accumulator width 48 | 64. The MACC operation involves the accumulation of successive products, usually in a register. The correct choice of its size depends on: i) the non-incursion in overflow when operating with a fixed point of view, ii) the inference of 'hard' DSP modules in the platforms that have them (for example in FPGAs). This parameter defines the size of that record.
[0210]
[0211] g_slt Sigmoid latency. It refers to the latency of each sigmoid_wp module (160a, 160b, 160c) of the activation function block 160.
[0212]
[0213] g_swi_b switch i max block length. Transfers through the input interface 101 are made in blocks or 'bursts'. Each of the slaves involved in the communication can pause it. Additionally, to prevent a single communication from absorbing all the bandwidth, the arbitrator can decide when to end it to attend other requests. This parameter sets the maximum number of elements that must be transmitted in each block. Note that the lower the value of g_swi_b there will be more 'overhead' due to the protocol, so the throughput will be reduced. Higher values of g_swi_b will improve throughput at the expense of introducing slightly higher latencies. However, the recommended procedure, when resources and requirements permit, is to load all the fifos before starting the computation, so that all the bandwidth is available for FIFO 'i'.
[0214]
[0215] g_swi_a switch i max stb to ack offset (power). The Wishbone protocol used sends 'STB' and expects to receive 'ACK' in response. Due to the elements involved in the communication, the response is not immediate and it is possible that several STBs are sent before receiving the first ACK. This parameter defines the size of the counter that tracks the difference between the number of STBs sent and the number of received ACKs.
[0216]
[0217] g_swo_b 'switch or max block length. The same as g_swi_b, but applied to the output interface 102.
[0218] g_swo_a 'switch or max stb to ack offset (power). The same as g_swi_a, but applied to the output interface 102.
[0219]
[0220] From the above parameters and the choice made between the variety of architectures offered in some of the modules (such as the module 320 is a shared bus or a crossbar, or use multiplexers or displacement registers in the modules MUXa, MUXb, MUXc; or use a single or several activation functions, in addition to optionally the maximum operator and the bypass, or the optionality of the feedback of the output of the element 180 to the module 130), the design optimizes in synthesis time the size of the all records, signals and memory elements of the circuit, using the minimum resources necessary to enable the execution of problems with the given sizes. The following is a description of the variety of architectures / functionalities contemplated in the design:
[0221]
[0222] -Bias of the activation functions of each neuron. The ANNs contain a set of adjustable parameters whose values are optimized by applying some training / learning algorithm (learning). These parameters are the so-called "bias" of the nodes or neurons and the so-called "weights" of the interconnections between neurons. The "bias" of the activation functions of each neuron is introduced as the preload value in the DSP, each neuron requires a DSP (as it has been said, not necessarily exclusively to it) to perform the operation of sum of products (MACC) Necessary for the computation of the neural function After this operation, the output is optionally passed through the activation function, and it is received through the same port as the weights.As a result, only one signal is required for its implementation. Enabling a single bit carefully tuned.This signal is part of the three-bit register mentioned above.This solution is more efficient than other solutions that require a specific load channel for the bias.The use of the bias is defined (it is programmed ) optionally at runtime for each matrix operation.
[0223]
[0224] -Function activation. The vector product (or products), that is, the result of each MACC operation, can be filtered by a function (in the mathematical sense). This completes the neuronal processing of the hidden layer (s). The implementation of Figure 7 includes a block 160 for generating one or more activation functions. The sub-modules 160a 160b 160c refer to different instances (copies) of the same block, not to different activation functions. Each instance 160a 160b 160c may include different functions as sub-modules. Non-limiting examples of these functions they are the logistics sigmoid, the rectifier function, the maximum function, the hyperbolic tangent function, radial base functions, etc. These functions are implemented by means of a circuit based on Centered Recursive Interpolation (CRI ), which allows the selection of different types of functions, and which significantly optimizes the requirements of logical resources. Specifically, any additional multiplication is avoided, limiting the complexity of the circuit to a comparator, a subtracter and hard wired shifts. Additionally, when a symmetric function is computed, the module is operated, which is transformed in the last step in case the input sign is negative. The component is completely segmented ( pipelined), so it allows a flow or throughput of one data per clock cycle, and the total latency of the circuit is 4 cycles at most. In Figure 7, each configurable and programmable CRI circuit 160a 160b 160c has been referred to as sigmoid_wp. In fact, a CRI circuit can generate different activation functions, such as those mentioned above by way of example. Before the synthesis of the circuit, you can configure the number of different functions to be implemented, and once the IP core is implemented, you can choose between the implemented options by changing only one selector (a signal of length log2 (#functions), that is, logarithm in base 2 of the number of functions).
[0225]
[0226] -Feedback. The output of block 180 can be fed back to perform two consecutive matrix-vector operations (in block MACCs 150), with optional intermediate filtering (in activation function block 160); that is, with a single instance of maccs both the hidden layer and the output layer of the network are processed (complete processing of a vector in a "Single Layer Feedforward Network" or SLFN). It is processing, which is received through the input interface 101, until the reception of the MACC operations product filtered by the activation function of the first operation, computing the second matrix-vector operation as in an RVFL. RVFL is a type of network similar to an SLFN, but in that one some connections of the input nodes are made directly with output nodes, without going through neurons of the hidden layer .This feedback functionality is managed from the control module 180 , illustrated in figure 7, from which the design model of the IP 100 core is defined. The synthesis of the feedback line, as well as the logic associated with its management, has a cost in logical resources that can be saved if only regular networks are going to be processed, in which there are no interconnection jumps between layers (as, for example, in RVFLs). Therefore, the IP core can be implemented with feedback or without it. In the case that this feedback is included in the synthesis, its use is optional at runtime (it can be used or not). In other words, in time of synthesis, it is defined which network models are going to be executed and in time of execution, a particular network model is chosen by writing in a single record. That is, the locality of the data is used, avoiding the passage through the main memory (which is not part of the present disclosure).
[0227]
[0228] - Multilayer networks. To implement multilayer network models you can reuse (instantiate a single copy and perform multiple executions) the main component (core IP 1, 100, 100 '), or instantiate it multiple times (as many layers as you want to implement). Throughout the present text the expression "physical layers" is used to refer to the number of instances implemented in hardware, and the expression "model layers" to the number of layers that the RNA model to compute has.
[0229]
[0230] -Fixed dinner All the operations are carried out in a fixed point of adjustable precision, which means that you can define the word size used in each of the matrices that intervene in each execution. The declaration of constants in synthesis time has been automated, so that from the definition of the problem the optimal sizes in each circuit component are calculated. This optimization makes it possible to reduce, on the one hand, the number of memory elements required and, on the other hand, the complexity of the arithmetic operators used. In the particular case of implementation in FPGA, the latter only has a reflection in the activation functions, since the DSP modules are hard resources, that is, they can not be modified. However, this is relevant for the design of ASICs, since in this case only arithmetic operators of the required precision are performed. In embodiments of the invention, fixed word sizes are used throughout the circuit, so it is always computed with the same precision. It is the most efficient solution in area. In other embodiments of the invention, two sets of sizes are used, so that two operations with different accuracy can be computed. This option is specially designed to compute SLFN / RVFL in two steps with a single instance of the main component. The increase in area is minimal. In other embodiments of the invention, a "barrel shifter" is included at the output of each DSP, so that the precision is adjustable in execution time In this case, the control module 185 allows to change the accuracy by writing a single record. However, the "barrel shifter" can be an expensive component depending on the word size used and the desired displacement possibilities. In short, compared to floating-point solutions, the fixed point offers greater precision using the same word size, as long as the range of the signals can be kept represented. Therefore, to minimize the risk of incorrectness, all operations are preferably carried out in "full precision" and preferably the results are saturated before truncating or applying "trounding".
[0231]
[0232] The operation of all circuit components (IP 100 core) is derived from the maccs 150 block. Since the maccs150 block has a defined execution pattern, most modules (all except the control module 180) function as a "stream" (Note that in a stream, most components have no idea of the function they perform in the system and do not have explicit synchronization except with the modules immediately before and after.In addition, this synchronization is basic in complexity, so it is prioritized the use of 'flags' of one or two bits), and no information related to the memory position of the data is managed, thus optimizing the logical resources necessary for its management. The orchestration of operations guarantees that the output results are obtained in a certain order. Here is how information is managed in each module and what ranges are used. Specifically, the following explains: the connections to the MAI 200, the data entry, the data output, the clock domains and the programmability and control:
[0233]
[0234] -Connections to the interconnection core 200 (MAI). The connection ports to the MAI 200 shown in FIGS. 2-3 are read only (port 101 for connection to the swi block 11) or write-only (port 102 for connection to the swo block 17). Depending on the version of MAI 200 module, this can mean an optimization in the implementation. On the other hand, there is a third port (not illustrated in FIGS. 7-8) of read and write between the MAI module 200 and the control block 185 of the IP core 100, which is described below. With regard to transfers between swi120 block and swo190 block, there are mainly two possibilities of implementation: based mapping memory (memory-mapped) and implementation in "stream" mode implementation, since the data are used by rows or columns of a matrix and the MAI 200 module allows block transfers using any of the arrangements. On the other hand, interfaces with the MAI 200 module use virtual addresses that represent matrix, column and row pointers. The MAI 200 module is responsible for translating that representation to the physical address of the memory blocks external to the IP 100 core.
[0235]
[0236] -Data entry. One of the main aspects to be exploited to accelerate the computation of algorithms with intensive memory access requirements is the use of the data location. It is known that a thorough study of both temporal and spatial locality makes it possible to avoid passing through main memory (external to the core IP 1, 100, 100 ') by using smaller storage blocks close to the arithmetic modules. In embodiments of the invention, N DSP arithmetic modules (# DSP) DSP0-DSP9 are used to perform h vector multiplications, where h is the number of rows of the matrix in the matrix-vector product. Note that h depends on the layer model that you want to compute. Specifically, h is used to calculate the number of folds (folds) needed. Therefore, if h>#DSP (if h> N), more than one pass, iteration or repetition is required to process each input vector. This ability to repeat is called folding of layers (in English, folding or tiling), and refers to the number of repetitions required by a vector. Note that folding refers to the number of times that the maccs 150 block is reused, but it must be taken into account that the passes (waves) by the block maccs 150 are continuous (waves) (it can be seen as an overlap) until it leaves of having information to process. That is, if for example h = 12, # DSP = 3 and 5 consecutive vectors are processed, there will be ceil (12/3) * 5 = 20 waves of data circulating through the maccs 150 block. In the present text, the term " passed "u" wave "is applied to ceil (# DSP / h), so folded = passed -1 ( folding = pass - 1) or, which is the same, the number of passes or waves = folded 1. Thus, if h <= # DSP, a vector is processed in a single pass / wave, so there is no repetition (folding = 0). As has been said, to take advantage of the location of the data, BRAM 120i 120b1-120b9 blocks are used in the form of FIFO (fifteenth block 140 in figures 7 and 8). The first BRAM 140i block is used to buffer the input vectors. The remaining blocks BRAM 140b0-140b9 store the contents of the matrix: each of them stores a row of each group of N rows of the matrix, where N is the DSP number (#DSP). Figure 6 graphically shows the access patterns to the BRAM blocks that are used as fifos for the processing of three consecutive matrix-vector products in a problem with four passes (on the right, and the remainder (b ) On the left) . In relation to data entry, the following explains the data entry to the first fifos120i block, the data entry to the remaining fifos120b0-120b9 blocks, and the data entry to the swi block 11 (sub-block intercon_si 110 and sub-block f_ctrl1110-11110).
[0237] With respect to the first fifos 140i block, this block has a high spatial and temporal locality. So much so, that the minimum recommended depth is of a few vectors, offering very adequate scalability properties for problems with a large amount of data to be processed. When reflecting this behavior in the description of the modulol, 100, 100 ', the inference reference of FIFOs has been modified to protect each vector until its utility has expired. Note that the inference reference are templates that facilitate the manufacturer of the target platform (read FPGA) and / or the developer of the synthesis tool, so that the logical resources described in HDL produce the desired result. When modifying it, you can not use 'hard' FIFOs. However, most FPGAs implement these through BRAMs with additional logic optimized to behave as such. Therefore, the changes introduced are easily assimilated. Specifically, a circular FIFO has two indexes, read and write, that increase each time the corresponding operation is performed. Therefore, in case four vectors were written from the main memory before the processing of the first was finished, there would be a risk that the first vector would be overwritten. To prevent this, this design uses a safeguard record: the flag f takes as reference the first position of a vector and does not advance until its usefulness expires. The flag e is generated with the reading pointer, which is incremented in the usual way. The modification made resides in resetting the read pointer to the value registered at the end of each "pass" or "pass". In the last "pass" of each vector, however, the register takes the value of the pointer.The fact of intervening the pointer prevents FPGA synthesis tools from taking advantage of FIFO inference templates. invention, instead of buffer, a scratchpad with the simplified interface is used to avoid direct management of any address.The scratchpad consumer (that is, any module that reads the scratchpad information) has only three control signals: RST or signal to reset all pointers to zero; FB or signal to reset the read pointer to the registered value; FW or signal to advance the register to the reading pointer value In the IP core of the present invention, the consumer module is the module Maccs
[0238]
[0239] As for the remaining fifos blocks 140b0-140b9, these blocks show a high spatial locality, but a less desirable behavior in terms of temporary locality. This feature is not desirable for large arrays, since each block must store as many full rows as the problem requires. However, this drawback is compensated by the fact that the implementation is preferably a pure FIFO, that is, a classic buffer without address information. By having a linear reading pattern, in case of not being able to enter all the content, the communication controller (f_ctrl (swi)) can and should repeat the reading sequence for each vector. The design automatically detects this situation from the choice of configuration parameters.
[0240] With respect to the block swi 120 (sub-block intercon_si 320 and sub-block f_ctrl 3301 3311), the instance f_ctrl of each sub-block 3301-3311 has two chained counters corresponding to the number of vectors to be transferred and the length of these . In the case of sub-block 3301 corresponding to the first fifos block 140i, the number of vectors equals the number of consecutive matrix-vector products that are to be computed. In the rest of sub-blocks 3302-3311, the number of vectors is the number of passes that must be made with the corresponding DSP. This is possible because the present design uses the relative position of the fifos blocks 140b1-140b9 (and therefore, of the DSPs, and by extension of the saturation modules) as partial coding of the addresses. The component in charge of decoding the local address based on the position and the indices given is the arbitrator in the sub-block intercon_si 320 of block 120. Figure 9 shows a schematic of the main component (module si_adr) used to obtain row pointers pROW and column pCOL in the sub-block intercon_si 320, where it indicates if the transfer corresponds to the memory ioa any other, id is the position in the active memory array (the one for which the address is being decoded) and x and y are the values of the counters given by f_ctrl. In Figure 9, the black blocks are registers; the white block is a multiplexer and the "AND" block is a logical gate, the solid lines represent the datapath and the dotted line represents a selection signal.
[0241]
[0242] Returning to Figure 7, the operation of a sub-block f_ctrl 3301-3311 is as follows: First, the control block 185 writes the limits of the counters in two registers of f_ctrl. When sub-block f_ctrl 3311j detects the change, it sets STALL to 0 to indicate that it wants to receive information. Note that STALL is a signal defined in the Wishbone standard. CYC, GNT and STB, alluded to below, are also signals defined in said standard. Likewise, INC is a user signal according to said standard. When the arbitrator selects one of the sub-blocks 3301-3311 that is requiring transfer preferably applying the round-robin planning algorithm, that is, in order from the first to the last and back to start, sets CYC to 1. The The referee keeps INC to 1 for two cycles, which increases the counters of the sub-block f_ctrl 3301-3311 and allows filling si_adr (figure 9). The matrix pointer is obtained from the control module 185, since it is the same for all instances of the f_ctrl sub-block, except for the first fifo block 140i. The referee uses the pointers to request initiation of transfer to the MAI 200. When the GNT is received, with each STB: The tag (x and y) of the request (typical of the Wishbone standard) made and activated is saved in a small FIFO. , to move forward in the pipe to the same time that requests are made. What is specifically requested is the content of the matrices to be multiplied, to be stored in the fifos submodules. Specifically, a matrix is stored divided into 'b ' (modules 140b0-b9), and the other one is loaded only in 'i' (module 140i). Each ACK received from the MAI 200 is transmitted directly to the sub-block f_ctrl together with the corresponding tag read from the corresponding FIFO. The transfer ends if: the sub-block f_ctrl reaches the end of its task and sets STALL to 1; or if the STALL coming from the MAI 200, in turn from the main memory, is set to 1; or if a referee's internal transfer length counter reaches the limit. When the transfer is finished, the referee sets CYC to 0, and the sub-block f_ctrl updates the counters with the values immediately after the last received tag.
[0243]
[0244] This mechanism, in which the arbitrator of the block intercon_si 320 acts as a bridge between the instances of the sub-blocks f_ctrl3301-3311 and the arbitrator within the MAI 200, allows the use of local addresses in each sub-block f_ctrl 3301-3311, thus reducing the size of counters and registers. Also, transfer lengths independent of the length of the vectors can be established, and programmable at runtime (even dynamically). The latter is especially interesting because it allows the memories in block fifos 140 to be filled more or less at the same rate, and prevents some have received all their content while others are empty. The rate at which these memories are being filled is a configurable parameter at runtime: it can be from 1 (virtually zero decompensation, typically when transfers are limited to small blocks) to tens or hundreds of difference elements (typically when using blocks big). Note that decompensation is not necessarily a bad thing, since it may be more efficient to read the memories in larger blocks. In addition, you will usually expect to have loaded all the data before starting the computation.
[0245]
[0246] At a high level, the operation of the swi 120 and control 185 blocks with the MAI 200 can be considered a DMA (direct memory access, from English Direct Memory Access) that manages # DSP + 1 tasks concurrently (ie manages a number of tasks equal to the DSP number 1). What differentiates it is the programming, since thanks to an exhaustive analysis of the dimensional dependencies, in the implementation of the present invention the user defines the movements in terms of complete matrices, and the intervening registers and counters are automatically adjusted. In a common DMA, the generation of a sequence of instructions is required because it must indicate the transfer of each vector. For example, T. Chen et. Al ("DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous MachineLearning ", SIGPLAN Not., Vol. 49, No. 4, pp. 269-284, Feb. 2014) use a FIFO attached to each DMA to preload the instructions so that continuous intervention is not required. In the present invention, thanks to the use of counters and positional coding, no memories are needed, since the sequence is implicit.
[0247]
[0248] In addition to the above, although in Figures 7-8 a single port 101 is represented between the block intercon_si 320 and the MAI 200, the design allows the use of several: Or two, one for sub-block 3301 and another for the rest 3302-3311; or one for sub-block 3301 and k ports for the rest, where k <#DSP. In that case the port becomes a crossbar.
[0249]
[0250] -Data output (from the mmg sub-blocks (also called MUXa MUXb MUXc) 13a_out, 13b_out and 13c_out; from the activation functions; from the smerger_wp 170 block, and from the swo 190 block):
[0251]
[0252] In the subgroups mmg (or MUXa, MUXb, MUXc), since the data is output in ladder (outputs 13a_out, 13b_out and 13c_out), only one counter is used whose maximum module is equal to the number of lines it manages. From a pulse indicating the availability of the first data, the data is read sequentially and transmitted in series to the activation module 160. The activation module 160 (sigmoids) is composed of multiple instances, which may include one or several functions of activation (CRI circuits). When the last data of each group is written, a pulse is issued to indicate to the next subgroup mmg (MUXb) the availability of its first element. In this way the control logic is optimized, since it is not necessary for the main state machine to supervise the sequencing between instances of mmg.
[0253]
[0254] As for the activation functions (block 160), on the one hand, the number of activation functions is automatically calculated at run time according to the number of output ports 13a_out 13b_out 13c_out resulting from evaluating the global parameters. In embodiments of the invention, an activation function is implemented at the output of each accumulator 13a_out, 13b_out, 13c_out. In other embodiments of the invention, a single activation function is implemented for all accumulators. On the other hand, the activation functions are segmented and passive, that is, the data enters and leaves with the same constant frequency and latency. You can choose the function at run time, but if you use it, no intervention is required. These characteristics mean that the filtering requires few logical resources and a minimum calculation time. However, in the particular case that RNA is used as a classifier or predictor with several output nodes, the output of the classification layers requires a counter with a maximum module equal to that of the counter in the corresponding mmg sub-block (MUXa, MUXb, MUXc in Figure 7). This is because, unlike the rest, the maximum operator applied to an element vector is a compressor, so it requires knowing how many elements the group to compress has. However, it is a parameter that is calculated automatically: partially in synthesis time and partially in execution, from the definition of the problem given by the user.
[0255]
[0256] In case of using the variant that implements the mmg sub-blocks as a shift register (see figure 8), the maximum operator counter is necessary, regardless of whether the operator is used or not. Likewise, the Daisy chain of synchronization signals is transferred to the sub-blocks sigmoid_wp 160a 160b 160c.
[0257]
[0258] Regarding the block smerger_wp 170, in order to write the results in the main memory in the order required, the sub-block smerger171 must read the outputs of the FIFOs (which are buffers) 170a 170b 170c in groups of the same size as the groupings a the output 13a_out 13b_out 13c_out from the maccs 150 block. For this, it uses two counters, one with a maximum module equal to the group size, and another with a maximum modulus equal to the number of rows of the matrix in the matrix-vector product. This is because the number of elements of the last group read depends on the programming at runtime, so an approximation that optimizes the size of the second counter requires additional processing of some parameter. In the particular case that the RNA is used as a classifier, when the output layer of a classifier is processed, the use of the maximum operator is optional, since it may be useful to obtain the weight of all the classification labels. In case it is used, it is only necessary to read a data of each FIFO170a 170b 170c, because they have been previously compressed in the sub-block sigmoid_wp 160a 160b 160c. At the same time, a maximum operator is used to compress the partial results to a single value, which is written in the memory or 180. As in the block intercon_si320, the design contemplates the implementation of the smerger sub-block 171 as a crossbar, so that there are several instances of the memory or 180.
[0259]
[0260] Finally, as for block swo 190, like block swi 120, swo 190 uses a virtual address format composed of matrix, row and column pointers. However, unlike input 120, during the execution of a problem the matrix in swo 190 is fixed, since all the resulting vectors belong to the same set. Another notable difference is that there are no passes to decode, as they arrive signaled Therefore, the implementation is resolved with two counters, row and column. The length of the column is set at runtime by defining the problem in the control block 185. In the particular case of a classification network, when the maximum operator is used in this network, as already mentioned, the counter of rows columns is unnecessary and remains at a fixed value. In case of implementing the smerger sub-block 171 as a crossbar, and thus having several memories or 180, the swo 190 block can be implemented as a multiplexer, or as another crossbar with several connections to the MAI 200. In case of that the number of blocks swo190 is equal to the number of memories or 180, the crossbar is automatically replaced by direct connections, since the excess connections of the crossbar are inefficient and unnecessary.
[0261]
[0262] Another relevant aspect in the IP block 1, 100, 100 'of the invention is that relating to the clock domains. In embodiments of the invention, the architecture can use up to three different clock domains. The embodiments shown in Figures 2, 7 and 8 use two clock domains: one for logic and arithmetic, and one for communications and transfers to / from main memory. In embodiments of the invention, in which the processing module 10 'is used as a subsystem in a larger system, the IP block 1, 100, 100' preferably uses at least one third domain (in addition to the two mentioned clock domains) ). Figure 10 shows each of these three clock domains. A first domain 71, for maccs 150 and sigmoids 160 blocks. A second domain 72, for blocks swi 120, smerger_wp 170 and swo 190. And a third domain 73 is the common interconnection matrix (AXI Interconnect) with other peripherals . On some devices, such as Xilinx's Zynq, this frequency (of the third clock domain) is fixed. Figure 10 represents an example of system architecture (SoC), in which the IP block of the invention can be used. This example of SoC includes, in addition to block IP 1, 100, 100 'of the invention, other blocks (AXI-Interconnect, CPU, DDR Cont., DDR3, UART, PC). Note that in Figure 10, unlike Figure 7, the smerger170 sub-block is a crossbar, so there are multiple instances of FIFO or 190. Although a relatively high number with respect to the DSP number (#DSP), it is not optimal in a practical implementation, in figure 10 it is used to illustrate that all domain changes are made through double clock FIFOs, thus taking advantage of the same buffer or scratchpad. This simplifies the logic required to synchronize domains. The only component in the design of the IP block 1, 100, 100 'that uses both clock domains is the control block 185, since it must write to the internal configuration registers of all the modules at run time.
[0263] As mentioned above, the IP block 1, 100, 100 'described with reference to FIGS. 1-10 is specially designed to be integrated into the architecture of a system-on-a-chip (SoC) or architecture of a FPGA The IP block 1, 100, 100 'is especially compatible with the characteristics of the AXI bus. By way of example, without limitation, when the IP block 1, 100, 100 'has been designed to be integrated into an FPGA, once the IP block 1, 100, 100' has been synthesized with a certain configuration, it has been done the "place and route" (which is part of the process of implementing a design described in HDL, EDA tools first synthesize the code by extracting a generic netlist, but then you have to place each element or component spatially and the interconnections between them ( route) in the final implementation technology (chip)) and the FPGA has been configured, problems of any size can be processed at runtime, provided that the synthesis values are not exceeded (number of inputs and number of outputs) For this purpose, a series of registers (v, f, h, c, see figure 1) are written in a control subnet within the MAI 200. In addition, the writing of several registers is required. more in embodiments of the invention, between seven and ten more registers), depending on the network model to be computed. Once the execution is started, it is completely unattended (autonomous) and any external processor can perform other operations until all consecutive scheduled products have been completed. Likewise, in order to avoid the inclusion of modules with a single use, it is expected that four parameters derived from the relationships between h and # DSP (number of DSP), on the one hand, and between c and # DSP, on the other, are pre-computed. . Finally, we must write the three matrix pointers that are used in each matrix-vector product: the two input and the result. These pointers are virtual identifiers. They are non-absolute addresses. There are a few bits (typically between 2 and 7, depending on the number of arrays to be managed) that are used in the MAI to obtain the complete absolute address (typically 32 bits). Alternatively, if it is desired to dispense with the MAI, these identifiers can represent the offset of each matrix in a shared memory, considering them the most significant bits (MSBs) of the absolute address. In case of executing an SLFN model, which performs two consecutive operations automatically, the writing of three other pointers is required. Finally, the processing begins with the writing of a value that represents the desired sequence.
[0264]
[0265] The processing module of the present invention, as well as the SoC-based architecture including said processing module, have special application in the computation of neural networks.
[0266] In sum, the IP core of the present disclosure is automatically adapted to the network model selected in two phases: in synthesis, by the "folding" of the layers to reuse the hardware resources in an optimized manner (and the consequent adaptation of the control and data flow), and at runtime (once the processor is implemented) by using a series of programmable configuration registers, which allows adjusting the size of the layer to be processed and the type of activation function, writing in those records.
[0267]
[0268] During the synthesis, while proposals such as that of DianNao optimize each vector product using a tree of adders and multiple multipliers in parallel, the design of the present disclosure performs each vector product sequentially (MACC operation in a DSP), and exploits the possibilities of parallelization in the outer loop of a matrix product. In other words, DianNao optimizes the computation of each neuron and the present IP block optimizes the computation of several neurons in parallel. In addition, since DianNao does not compute multiple neurons in parallel, the folding concept is not considered.
[0269]
[0270] During the execution, in front of conventional neuronal processors, which have a fixed structure, so they can be configurable (in synthesis) but not programmable, in the design of the present disclosure a smaller network is effectively processed in case of deactivation of some neurons, so that computing time is reduced. Moreover, the necessary hardware resources are practically independent of the size of the layers, with the number of DSPs being the main indicator. In other words, using the same number of DSPs, the design of the invention has similar requirements for layers with vectors of tens, hundreds or thousands of elements.
[0271]
[0272] In short, the present invention provides a configurable and programmable processor (computer) of matrix products (in possible embodiments, large) in a fixed point. The processor has particular characteristics (activation functions, etc.) that make it especially interesting for neuronal processing, although in its configuration phase (pre-synthesis) these characteristics can be eliminated to be used as a simple matrix multiplier. In this sense, the processor is an accelerator of linear transformations, which can have wide applicability, to which non-linear transformations (activation functions) are added to operate as a neural network. The fact that it is programmable is relevant. The reuse of a single IP core has been prioritized, so it is highly configurable / scalable, portable and reusable for the computation of multiple problems (network models) of different size and configuration, instead of looking for the optimal configuration for a smaller range. This is also a consequence of how the VHDL code has been written. In addition to being a linear systolic array, it requires much less resources than a solution based on a rectangular systolic array. Nor does it rule out the applicability to recurrent networks (not feedforward), in which it is a matter of re-feeding some or all of the outputs of a layer to its inputs.
[0273]
[0274] In this text, the word "comprises" and its variants (such as "understanding", etc.) should not be interpreted in an exclusive manner, that is, they do not exclude the possibility that the description includes other elements, steps, etc. On the other hand, the invention is not limited to the specific embodiments that have been described but also covers, for example, the variants that can be made by the average expert in the field (for example, as regards the choice of materials, dimensions , components, configuration, etc.), within what is clear from the claims.

权利要求:
Claims (18)
[1]
1.- A configurable and programmable IP core (1, 100, 100 ') for the computation of a plurality of matrix products, in which both the data to be processed and the results obtained are transferred in series, comprising:
a data entry block (12, 120) configured to provide, from an input data, a set of vectors representing a first and a second matrix whose product is to be computed, using a virtual address format composed of pointers a matrix, row and column, wherein said data entry block (120) comprises:
a first sub-block (32, 320) configured to obtain a row pointer (pROW) and a column pointer (pCOL); Y
a second sub-block (332; 3301-3311) comprising N components (3301 3311), where N is a natural number> 1, each of which comprises two chained counters corresponding to the number of vectors to be transferred and the length of said vectors, where each component (3301-3311) uses local addresses;
a memory block (14, 140) comprising N memory elements (141-145; 140i, 140b0 ... 140b9), each of said memory elements being associated with a respective output of said second sub-block (32). ; 3301-3311) of the data entry block (120);
a fixed-matrix vector-matrix multiplier block (15; 150) configured to implement a multiplication-accumulation operation to multiply a matrix by multiple vectors received in series continuously, wherein said multiplier block, matrixvector in a fixed point (15; 150) it comprises a set of sub-blocks (150a, 150b, 150c), wherein each of said sub-blocks comprises a plurality of arithmetic modules (DSP ⁰ -DSP ³ ; DSP ⁴ -DSP ⁷ ; DSP ⁸ -DSP ⁹ ).
a block (16, 160) comprising at least one activation function configured to be applied to the output of said fixed-matrix vector-matrix multiplier block (15, 150);
a block (17; 170, 180) for storing in storage components (170a, 170b, 170c) the outputs of the at least one activation function and for reading (171) the outputs of said storage components (170a, 170b, 170c); Y
a data output block (18, 190) that uses a virtual address format composed of pointers to matrix, row and column, comprising a row counter and a column counter.
[2]
2. The IP core (1, 100, 100 ') of claim 1, wherein the first component (3301) of said second sub-block is configured to provide a number of vectors equal to the number of consecutive matrix-vector products that you want to compute.
[3]
3. The IP core (1, 100, 100 ') of any of the preceding claims, wherein the second to last components (3302-3311) of said second sub-block are configured to provide a number of vectors equal to the number of passes that must be made with the corresponding DSP.
[4]
4. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein said fixed-matrix vector-matrix multiplier block (150) is based on a linear systolic array with parameter loading in parallel and execution type wave.
[5]
5.
[6]
6. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein each sub-block (150a, 150b, 150c) of said fixed-matrix vector-matrix multiplier block (150) comprises a multiplexer ( MUXa, MUXb, MUXc) upon departure.
[7]
7. The IP core (1, 100, 100 ') of any of claims 1 to 5, wherein each sub-block (150a, 150b, 150c) of said fixed-matrix vector-matrix multiplier block (150) comprises its output as many shift registers (R'0 ... R'9) as arithmetic modules have each sub-block.
[8]
8
[9]
9. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein the parallel execution of said arithmetic modules (DSP0-DSP3; DSP4-DSP7; DSP8-DSP9) is controlled by a state machine which takes as reference only the first arithmetic module (DSP0).
[10]
10. The IP core (1, 100, 100 ') of claim 9, wherein said state machine it uses three support counters: length of the vector, number of repetitions and latency of the arithmetic module.
[11]
11. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein said fixed-matrix vector-matrix multiplier block (150) represents at least one layer
hidden from an artificial neural network.
[12]
12. - The IP core (1, 100, 100 ') of any of the preceding claims, comprising means for feeding back the output of said FIFO block (180) to perform at least two consecutive matrix-vector operations with an intermediate filtering, so
that with a single instance of the fixed-point vector-matrix multiplier block (150) both the at least one hidden layer and the output layer of the neural network are processed.
[13]
13. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein M arithmetic modules (DSP ⁰ -DSP9) are used to perform h vector multiplications, where h is the number of rows of the matrix in the matrix-vector product, so if
> M, more than one pass, iteration or repetition is required to process each input vector.
[14]
14. - The IP core (1, 100, 100 ') of any of the preceding claims, further comprising an interconnect core ( ²⁰⁰ ) configured to facilitate the integration of the IP core ( ¹ , ¹⁰⁰ , ¹⁰⁰ ') in a system heterogeneous with one or several coprocessor modules, said interconnection core ( ²⁰⁰ ) being configured to manage memory blocks internal and external to the IP core (1, 100, 100 ').
[15]
15. - The IP core (1, 100, 100 ') of any of the preceding claims, wherein all the operations are carried out in a fixed point of adjustable precision, being configured to define the word size used in each of the matrices that intervene in each execution.
[16]
16. A system on chip (SoC) comprising at least one IP core (1, 100, 100 ') according to any of the preceding claims.
[17]
17. - An FPGA comprising at least one IP core (1, 100, 100 ') according to any of claims 1-15.
[18]
18 - A method of designing an IP core (1, 100, 100 ') according to any of claims 1-15, suitable for an objective technology, comprising:
generate a netlist comprising a parameterized description of the IP core (1,
¹⁰⁰ , ¹⁰⁰ ') suitable for said target technology;
synthesize a neural network that one wishes to implement, adapting the IP core (1, 100, 100 ') to the resources available in said objective technology, where said adaptation is carried out by means of a technique of folding and reusing neuronal layers;
Once a certain size of neuronal network has been synthesized, at runtime select a number of neurons that you want to use in each network layer.

类似技术:

公开号 | 公开日 | 专利标题

CN106940815B|2020-07-28|Programmable convolutional neural network coprocessor IP core

US20190114499A1|2019-04-18|Image preprocessing for generalized image processing

US10817587B2|2020-10-27|Reconfigurable matrix multiplier system and method

US7873811B1|2011-01-18|Polymorphous computing fabric

US20190147325A1|2019-05-16|Neural Network Architecture Using Control Logic Determining Convolution Operation Sequence

CN111542826A|2020-08-14|Digital architecture supporting analog coprocessors

JP5419419B2|2014-02-19|system

US10984500B1|2021-04-20|Inline image preprocessing for convolution operations using a matrix multiplier on an integrated circuit

Bank-Tavakoli et al.2019|Polar: A pipelined/overlapped fpga-based lstm accelerator

US10114795B2|2018-10-30|Processor in non-volatile storage memory

Eldridge et al.2015|Towards general-purpose neural network computing

JP5027515B2|2012-09-19|Reconfigurable logic device for parallel computation of arbitrary algorithms

Streat et al.2016|Non-volatile hierarchical temporal memory: Hardware for spatial pooling

US10515135B1|2019-12-24|Data format suitable for fast massively parallel general matrix multiplication in a programmable IC

Fons et al.2011|Run-time self-reconfigurable 2D convolver for adaptive image processing

US20210271630A1|2021-09-02|Compiler Flow Logic for Reconfigurable Architectures

Kritikos et al.2012|Redsharc: A programming model and on-chip network for multi-core systems on a programmable chip

ES2697693B2|2019-11-13|IP NUCLEO, ARCHITECTURE UNDERSTANDING AN IP NUCLEUS AND DESIGN PROCEDURE OF AN IP NUCLEUS

Petrica et al.2020|Memory-efficient dataflow inference for deep CNNs on FPGA

Diamantopoulos et al.2018|A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping

WO2021158861A1|2021-08-12|Scalable array architecture for in-memory computing

Lian2016|A framework for FPGA-based acceleration of neural network inference with limited numerical precision via high-level synthesis with streaming functionality

US10943039B1|2021-03-09|Software-driven design optimization for fixed-point multiply-accumulate circuitry

Martinez-Corral et al.2017|A fully configurable and scalable neural coprocessor ip for soc implementations of machine learning applications

EP3859535A1|2021-08-04|Streaming access memory device, system and method

同族专利:

公开号 | 公开日

ES2697693B2|2019-11-13|

WO2019020856A1|2019-01-31|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

CA2215598A1|1996-11-12|1998-05-12|Lucent Technologies Inc.|Fpga-based processor|

US20140289445A1|2013-03-22|2014-09-25|Antony Savich|Hardware accelerator system and method|

US9978014B2|2013-12-18|2018-05-22|Intel Corporation|Reconfigurable processing unit|

法律状态:
2019-01-25| BA2A| Patent application published|Ref document number: 2697693 Country of ref document: ES Kind code of ref document: A1 Effective date: 20190125 |

2019-11-13| FG2A| Definitive protection|Ref document number: 2697693 Country of ref document: ES Kind code of ref document: B2 Effective date: 20191113 |

优先权:

申请号 | 申请日 | 专利标题

ES201730963A|ES2697693B2|2017-07-24|2017-07-24|IP NUCLEO, ARCHITECTURE UNDERSTANDING AN IP NUCLEUS AND DESIGN PROCEDURE OF AN IP NUCLEUS|ES201730963A| ES2697693B2|2017-07-24|2017-07-24|IP NUCLEO, ARCHITECTURE UNDERSTANDING AN IP NUCLEUS AND DESIGN PROCEDURE OF AN IP NUCLEUS|

PCT/ES2018/070526| WO2019020856A1|2017-07-24|2018-07-23|Ip core, architecture comprising an ip core and method of designing an ip core|

[返回顶部]